sync w/ main trunk
[bioperl-live.git] / Bio / SearchIO / blast.pm
blob43b0cc064af925acdd13437452ce3e6bb858bbc1
1 # $Id$
3 # BioPerl module for Bio::SearchIO::blast
5 # Please direct questions and support issues to <bioperl-l@bioperl.org>
7 # Cared for by Jason Stajich <jason@bioperl.org>
9 # Copyright Jason Stajich
11 # You may distribute this module under the same terms as perl itself
13 # POD documentation - main docs before the code
15 # 20030409 - sac
16 # PSI-BLAST full parsing support. Rollout of new
17 # model which will remove Steve's old psiblast driver
18 # 20030424 - jason
19 # Megablast parsing fix as reported by Neil Saunders
20 # 20030427 - jason
21 # Support bl2seq parsing
22 # 20031124 - jason
23 # Parse more blast statistics, lambda, entropy, etc
24 # from WU-BLAST in frame-specific manner
25 # 20060216 - cjf - fixed blast parsing for BLAST v2.2.13 output
26 # 20071104 - dmessina - added support for WUBLAST -echofilter
27 # 20071121 - cjf - fixed several bugs (bugs 2391, 2399, 2409)
29 =head1 NAME
31 Bio::SearchIO::blast - Event generator for event based parsing of
32 blast reports
34 =head1 SYNOPSIS
36 # Do not use this object directly - it is used as part of the
37 # Bio::SearchIO system.
39 use Bio::SearchIO;
40 my $searchio = Bio::SearchIO->new(-format => 'blast',
41 -file => 't/data/ecolitst.bls');
42 while( my $result = $searchio->next_result ) {
43 while( my $hit = $result->next_hit ) {
44 while( my $hsp = $hit->next_hsp ) {
45 # ...
50 =head1 DESCRIPTION
52 This object encapsulated the necessary methods for generating events
53 suitable for building Bio::Search objects from a BLAST report file.
54 Read the L<Bio::SearchIO> for more information about how to use this.
56 This driver can parse:
58 =over 4
60 =item *
62 NCBI produced plain text BLAST reports from blastall, this also
63 includes PSIBLAST, PSITBLASTN, RPSBLAST, and bl2seq reports. NCBI XML
64 BLAST output is parsed with the blastxml SearchIO driver
66 =item *
68 WU-BLAST all reports
70 =item *
72 Jim Kent's BLAST-like output from his programs (BLASTZ, BLAT)
74 =item *
76 BLAST-like output from Paracel BTK output
78 =back
80 =head2 bl2seq parsing
82 Since I cannot differentiate between BLASTX and TBLASTN since bl2seq
83 doesn't report the algorithm used - I assume it is BLASTX by default -
84 you can supply the program type with -report_type in the SearchIO
85 constructor i.e.
87 my $parser = Bio::SearchIO->new(-format => 'blast',
88 -file => 'bl2seq.tblastn.report',
89 -report_type => 'tblastn');
91 This only really affects where the frame and strand information are
92 put - they will always be on the $hsp-E<gt>query instead of on the
93 $hsp-E<gt>hit part of the feature pair for blastx and tblastn bl2seq
94 produced reports. Hope that's clear...
96 =head1 FEEDBACK
98 =head2 Mailing Lists
100 User feedback is an integral part of the evolution of this and other
101 Bioperl modules. Send your comments and suggestions preferably to
102 the Bioperl mailing list. Your participation is much appreciated.
104 bioperl-l@bioperl.org - General discussion
105 http://bioperl.org/wiki/Mailing_lists - About the mailing lists
107 =head2 Support
109 Please direct usage questions or support issues to the mailing list:
111 L<bioperl-l@bioperl.org>
113 rather than to the module maintainer directly. Many experienced and
114 reponsive experts will be able look at the problem and quickly
115 address it. Please include a thorough description of the problem
116 with code and data examples if at all possible.
118 =head2 Reporting Bugs
120 Report bugs to the Bioperl bug tracking system to help us keep track
121 of the bugs and their resolution. Bug reports can be submitted via the
122 web:
124 http://bugzilla.open-bio.org/
126 =head1 AUTHOR - Jason Stajich
128 Email Jason Stajich jason-at-bioperl.org
130 =head1 CONTRIBUTORS
132 Steve Chervitz sac-at-bioperl.org
134 =head1 APPENDIX
136 The rest of the documentation details each of the object methods.
137 Internal methods are usually preceded with a _
139 =cut
141 # Let the code begin...'
143 package Bio::SearchIO::blast;
145 use Bio::SearchIO::IteratedSearchResultEventBuilder;
146 use strict;
147 use vars qw(%MAPPING %MODEMAP
148 $DEFAULT_BLAST_WRITER_CLASS
149 $MAX_HSP_OVERLAP
150 $DEFAULT_SIGNIF
151 $DEFAULT_SCORE
152 $DEFAULTREPORTTYPE
156 use base qw(Bio::SearchIO);
157 use Data::Dumper;
159 BEGIN {
161 # mapping of NCBI Blast terms to Bioperl hash keys
162 %MODEMAP = (
163 'BlastOutput' => 'result',
164 'Iteration' => 'iteration',
165 'Hit' => 'hit',
166 'Hsp' => 'hsp'
169 # This should really be done more intelligently, like with
170 # XSLT
172 %MAPPING = (
173 'Hsp_bit-score' => 'HSP-bits',
174 'Hsp_score' => 'HSP-score',
175 'Hsp_evalue' => 'HSP-evalue',
176 'Hsp_n', => 'HSP-n',
177 'Hsp_pvalue' => 'HSP-pvalue',
178 'Hsp_query-from' => 'HSP-query_start',
179 'Hsp_query-to' => 'HSP-query_end',
180 'Hsp_hit-from' => 'HSP-hit_start',
181 'Hsp_hit-to' => 'HSP-hit_end',
182 'Hsp_positive' => 'HSP-conserved',
183 'Hsp_identity' => 'HSP-identical',
184 'Hsp_gaps' => 'HSP-hsp_gaps',
185 'Hsp_hitgaps' => 'HSP-hit_gaps',
186 'Hsp_querygaps' => 'HSP-query_gaps',
187 'Hsp_qseq' => 'HSP-query_seq',
188 'Hsp_hseq' => 'HSP-hit_seq',
189 'Hsp_midline' => 'HSP-homology_seq',
190 'Hsp_align-len' => 'HSP-hsp_length',
191 'Hsp_query-frame' => 'HSP-query_frame',
192 'Hsp_hit-frame' => 'HSP-hit_frame',
193 'Hsp_links' => 'HSP-links',
194 'Hsp_group' => 'HSP-hsp_group',
195 'Hsp_features' => 'HSP-hit_features',
197 'Hit_id' => 'HIT-name',
198 'Hit_len' => 'HIT-length',
199 'Hit_accession' => 'HIT-accession',
200 'Hit_def' => 'HIT-description',
201 'Hit_signif' => 'HIT-significance',
202 # For NCBI blast, the description line contains bits.
203 # For WU-blast, the description line contains score.
204 'Hit_score' => 'HIT-score',
205 'Hit_bits' => 'HIT-bits',
207 'Iteration_iter-num' => 'ITERATION-number',
208 'Iteration_converged' => 'ITERATION-converged',
210 'BlastOutput_program' => 'RESULT-algorithm_name',
211 'BlastOutput_version' => 'RESULT-algorithm_version',
212 'BlastOutput_query-def' => 'RESULT-query_name',
213 'BlastOutput_query-len' => 'RESULT-query_length',
214 'BlastOutput_query-acc' => 'RESULT-query_accession',
215 'BlastOutput_query-gi' => 'RESULT-query_gi',
216 'BlastOutput_querydesc' => 'RESULT-query_description',
217 'BlastOutput_db' => 'RESULT-database_name',
218 'BlastOutput_db-len' => 'RESULT-database_entries',
219 'BlastOutput_db-let' => 'RESULT-database_letters',
220 'BlastOutput_inclusion-threshold' => 'RESULT-inclusion_threshold',
222 'Parameters_matrix' => { 'RESULT-parameters' => 'matrix' },
223 'Parameters_expect' => { 'RESULT-parameters' => 'expect' },
224 'Parameters_include' => { 'RESULT-parameters' => 'include' },
225 'Parameters_sc-match' => { 'RESULT-parameters' => 'match' },
226 'Parameters_sc-mismatch' => { 'RESULT-parameters' => 'mismatch' },
227 'Parameters_gap-open' => { 'RESULT-parameters' => 'gapopen' },
228 'Parameters_gap-extend' => { 'RESULT-parameters' => 'gapext' },
229 'Parameters_filter' => { 'RESULT-parameters' => 'filter' },
230 'Parameters_allowgaps' => { 'RESULT-parameters' => 'allowgaps' },
231 'Parameters_full_dbpath' => { 'RESULT-parameters' => 'full_dbpath' },
232 'Statistics_db-len' => { 'RESULT-statistics' => 'dbentries' },
233 'Statistics_db-let' => { 'RESULT-statistics' => 'dbletters' },
234 'Statistics_hsp-len' =>
235 { 'RESULT-statistics' => 'effective_hsplength' },
236 'Statistics_query-len' => { 'RESULT-statistics' => 'querylength' },
237 'Statistics_eff-space' => { 'RESULT-statistics' => 'effectivespace' },
238 'Statistics_eff-spaceused' =>
239 { 'RESULT-statistics' => 'effectivespaceused' },
240 'Statistics_eff-dblen' =>
241 { 'RESULT-statistics' => 'effectivedblength' },
242 'Statistics_kappa' => { 'RESULT-statistics' => 'kappa' },
243 'Statistics_lambda' => { 'RESULT-statistics' => 'lambda' },
244 'Statistics_entropy' => { 'RESULT-statistics' => 'entropy' },
245 'Statistics_gapped_kappa' => { 'RESULT-statistics' => 'kappa_gapped' },
246 'Statistics_gapped_lambda' =>
247 { 'RESULT-statistics' => 'lambda_gapped' },
248 'Statistics_gapped_entropy' =>
249 { 'RESULT-statistics' => 'entropy_gapped' },
251 'Statistics_framewindow' =>
252 { 'RESULT-statistics' => 'frameshiftwindow' },
253 'Statistics_decay' => { 'RESULT-statistics' => 'decayconst' },
255 'Statistics_hit_to_db' => { 'RESULT-statistics' => 'Hits_to_DB' },
256 'Statistics_num_suc_extensions' =>
257 { 'RESULT-statistics' => 'num_successful_extensions' },
259 # WU-BLAST stats
260 'Statistics_DFA_states' => { 'RESULT-statistics' => 'num_dfa_states' },
261 'Statistics_DFA_size' => { 'RESULT-statistics' => 'dfa_size' },
262 'Statistics_noprocessors' =>
263 { 'RESULT-statistics' => 'no_of_processors' },
264 'Statistics_neighbortime' =>
265 { 'RESULT-statistics' => 'neighborhood_generate_time' },
266 'Statistics_starttime' => { 'RESULT-statistics' => 'start_time' },
267 'Statistics_endtime' => { 'RESULT-statistics' => 'end_time' },
270 # add WU-BLAST Frame-Based Statistics
271 for my $frame ( 0 .. 3 ) {
272 for my $strand ( '+', '-' ) {
273 for my $ind (
274 qw(length efflength E S W T X X_gapped E2
275 E2_gapped S2)
278 $MAPPING{"Statistics_frame$strand$frame\_$ind"} =
279 { 'RESULT-statistics' => "Frame$strand$frame\_$ind" };
281 for my $val (qw(lambda kappa entropy )) {
282 for my $type (qw(used computed gapped)) {
283 my $key = "Statistics_frame$strand$frame\_$val\_$type";
284 my $val =
285 { 'RESULT-statistics' =>
286 "Frame$strand$frame\_$val\_$type" };
287 $MAPPING{$key} = $val;
293 # add Statistics
294 for my $stats (
295 qw(T A X1 X2 X3 S1 S2 X1_bits X2_bits X3_bits
296 S1_bits S2_bits num_extensions
297 num_successful_extensions
298 seqs_better_than_cutoff
299 posted_date
300 search_cputime total_cputime
301 search_actualtime total_actualtime
302 no_of_processors ctxfactor)
305 my $key = "Statistics_$stats";
306 my $val = { 'RESULT-statistics' => $stats };
307 $MAPPING{$key} = $val;
310 # add WU-BLAST Parameters
311 for my $param (
312 qw(span span1 span2 links warnings notes hspsepsmax
313 hspsepqmax topcomboN topcomboE postsw cpus wordmask
314 filter sort_by_pvalue sort_by_count sort_by_highscore
315 sort_by_totalscore sort_by_subjectlength noseqs gi qtype
316 qres V B Z Y M N)
319 my $key = "Parameters_$param";
320 my $val = { 'RESULT-parameters' => $param };
321 $MAPPING{$key} = $val;
324 $DEFAULT_BLAST_WRITER_CLASS = 'Bio::Search::Writer::HitTableWriter';
325 $MAX_HSP_OVERLAP = 2; # Used when tiling multiple HSPs.
326 $DEFAULTREPORTTYPE = 'BLASTP'; # for bl2seq
329 =head2 new
331 Title : new
332 Usage : my $obj = Bio::SearchIO::blast->new(%args);
333 Function: Builds a new Bio::SearchIO::blast object
334 Returns : Bio::SearchIO::blast
335 Args : Key-value pairs:
336 -fh/-file => filehandle/filename to BLAST file
337 -format => 'blast'
338 -report_type => 'blastx', 'tblastn', etc -- only for bl2seq
339 reports when you want to distinguish between
340 tblastn and blastx reports (this only controls
341 where the frame information is put - on the query
342 or subject object.
343 -inclusion_threshold => e-value threshold for inclusion in the
344 PSI-BLAST score matrix model (blastpgp)
345 -signif => float or scientific notation number to be used
346 as a P- or Expect value cutoff
347 -score => integer or scientific notation number to be used
348 as a blast score value cutoff
349 -bits => integer or scientific notation number to be used
350 as a bit score value cutoff
351 -hit_filter => reference to a function to be used for
352 filtering hits based on arbitrary criteria.
353 All hits of each BLAST report must satisfy
354 this criteria to be retained.
355 If a hit fails this test, it is ignored.
356 This function should take a
357 Bio::Search::Hit::BlastHit.pm object as its first
358 argument and return true
359 if the hit should be retained.
360 Sample filter function:
361 -hit_filter => sub { $hit = shift;
362 $hit->gaps == 0; },
363 (Note: -filt_func is synonymous with -hit_filter)
364 -overlap => integer. The amount of overlap to permit between
365 adjacent HSPs when tiling HSPs. A reasonable value is 2.
366 Default = $Bio::SearchIO::blast::MAX_HSP_OVERLAP.
368 The following criteria are not yet supported:
369 (these are probably best applied within this module rather than in the
370 event handler since they would permit the parser to take some shortcuts.)
372 -check_all_hits => boolean. Check all hits for significance against
373 significance criteria. Default = false.
374 If false, stops processing hits after the first
375 non-significant hit or the first hit that fails
376 the hit_filter call. This speeds parsing,
377 taking advantage of the fact that the hits
378 are processed in the order they appear in the report.
379 -min_query_len => integer to be used as a minimum for query sequence length.
380 Reports with query sequences below this length will
381 not be processed. Default = no minimum length.
382 -best => boolean. Only process the best hit of each report;
383 default = false.
385 =cut
387 sub _initialize {
388 my ( $self, @args ) = @_;
389 $self->SUPER::_initialize(@args);
391 # Blast reports require a specialized version of the SREB due to the
392 # possibility of iterations (PSI-BLAST). Forwarding all arguments to it. An
393 # issue here is that we want to set new default object factories if none are
394 # supplied.
396 my $handler = Bio::SearchIO::IteratedSearchResultEventBuilder->new(@args);
397 $self->attach_EventHandler($handler);
399 # 2006-04-26 move this to the attach_handler function in this module so we
400 # can really reset the handler
401 # Optimization: caching
402 # the EventHandler since it is used a lot during the parse.
404 # $self->{'_handler_cache'} = $handler;
406 my ( $min_qlen, $check_all, $overlap, $best, $rpttype ) = $self->_rearrange(
408 qw(MIN_LENGTH CHECK_ALL_HITS
409 OVERLAP BEST
410 REPORT_TYPE)
412 @args
415 defined $min_qlen && $self->min_query_length($min_qlen);
416 defined $best && $self->best_hit_only($best);
417 defined $check_all && $self->check_all_hits($check_all);
418 defined $rpttype && ( $self->{'_reporttype'} = $rpttype );
421 sub attach_EventHandler {
422 my ($self,$handler) = @_;
424 $self->SUPER::attach_EventHandler($handler);
426 # Optimization: caching the EventHandler since it is used a lot
427 # during the parse.
429 $self->{'_handler_cache'} = $handler;
430 return;
433 =head2 next_result
435 Title : next_result
436 Usage : my $hit = $searchio->next_result;
437 Function: Returns the next Result from a search
438 Returns : Bio::Search::Result::ResultI object
439 Args : none
441 =cut
443 sub next_result {
444 my ($self) = @_;
445 my $v = $self->verbose;
446 my $data = '';
447 my $flavor = '';
448 $self->{'_seentop'} = 0; # start next report at top
449 $self->{'_seentop'} = 0;
450 my ( $reporttype, $seenquery, $reportline );
451 my ( $seeniteration, $found_again );
452 my $incl_threshold = $self->inclusion_threshold;
453 my $bl2seq_fix;
454 $self->start_document(); # let the fun begin...
455 my (@hit_signifs);
456 my $gapped_stats = 0; # for switching between gapped/ungapped
457 # lambda, K, H
458 local $_ = "\n"; #consistency
459 PARSER:
460 while ( defined( $_ = $self->_readline ) ) {
461 next if (/^\s+$/); # skip empty lines
462 next if (/CPU time:/);
463 next if (/^>\s*$/);
464 if (
465 /^((?:\S+?)?BLAST[NPX]?)\s+(.+)$/i # NCBI BLAST, PSIBLAST
466 # RPSBLAST, MEGABLAST
467 || /^(P?GENEWISE|HFRAME|SWN|TSWN)\s+(.+)/i #Paracel BTK
470 ($reporttype, my $reportversion) = ($1, $2);
471 # need to keep track of whether this is WU-BLAST
472 if ($reportversion && $reportversion =~ m{WashU$}) {
473 $self->{'_wublast'}++;
475 $self->debug("blast.pm: Start of new report: $reporttype, $reportversion\n");
476 if ( $self->{'_seentop'} ) {
477 # This handles multi-result input streams
478 $self->_pushback($_);
479 last PARSER;
481 $self->_start_blastoutput;
482 if ($reporttype =~ /RPS-BLAST/) {
483 $reporttype .= '(BLASTP)'; # default RPS-BLAST type
485 $reportline = $_; # to fix the fact that RPS-BLAST output is wrong
486 $self->element(
488 'Name' => 'BlastOutput_program',
489 'Data' => $reporttype
493 $self->element(
495 'Name' => 'BlastOutput_version',
496 'Data' => $reportversion
499 $self->element(
501 'Name' => 'BlastOutput_inclusion-threshold',
502 'Data' => $incl_threshold
506 # added Windows workaround for bug 1985
507 elsif (/^(Searching|Results from round)/) {
508 next unless $1 =~ /Results from round/;
509 $self->debug("blast.pm: Possible psi blast iterations found...\n");
511 $self->in_element('hsp')
512 && $self->end_element( { 'Name' => 'Hsp' } );
513 $self->in_element('hit')
514 && $self->end_element( { 'Name' => 'Hit' } );
515 if ( defined $seeniteration ) {
516 $self->within_element('iteration')
517 && $self->end_element( { 'Name' => 'Iteration' } );
518 $self->start_element( { 'Name' => 'Iteration' } );
520 else {
521 $self->start_element( { 'Name' => 'Iteration' } );
523 $seeniteration = 1;
525 elsif (/^Query=\s*(.*)$/) {
526 my $q = $1;
527 $self->debug("blast.pm: Query= found...$_\n");
528 my $size = 0;
529 if ( defined $seenquery ) {
530 $self->_pushback($reportline) if $reportline;
531 $self->_pushback($_);
532 last PARSER;
534 else {
535 if ( !defined $reporttype ) {
536 $self->_start_blastoutput;
537 if ( defined $seeniteration ) {
538 $self->in_element('iteration')
539 && $self->end_element( { 'Name' => 'Iteration' } );
540 $self->start_element( { 'Name' => 'Iteration' } );
542 else {
543 $self->start_element( { 'Name' => 'Iteration' } );
545 $seeniteration = 1;
548 $seenquery = $q;
549 $_ = $self->_readline;
550 while ( defined($_) ) {
551 if (/^Database:/) {
552 $self->_pushback($_);
553 last;
555 # below line fixes length issue with BLAST v2.2.13; still works
556 # with BLAST v2.2.12
557 if ( /\((\-?[\d,]+)\s+letters.*\)/ || /^Length=(\-?[\d,]+)/ ) {
558 $size = $1;
559 $size =~ s/,//g;
560 last;
562 else {
563 # bug 2391
564 $q .= ($q =~ /\w$/ && $_ =~ /^\w/) ? " $_" : $_;
565 $q =~ s/\s+/ /g; # this catches the newline as well
566 $q =~ s/^ | $//g;
569 $_ = $self->_readline;
571 chomp($q);
572 my ( $nm, $desc ) = split( /\s+/, $q, 2 );
573 $self->element(
575 'Name' => 'BlastOutput_query-def',
576 'Data' => $nm
578 ) if $nm;
579 $self->element(
581 'Name' => 'BlastOutput_query-len',
582 'Data' => $size
585 defined $desc && $desc =~ s/\s+$//;
586 $self->element(
588 'Name' => 'BlastOutput_querydesc',
589 'Data' => $desc
592 my ( $gi, $acc, $version ) = $self->_get_seq_identifiers($nm);
593 $version = defined($version) && length($version) ? ".$version" : "";
594 $self->element(
596 'Name' => 'BlastOutput_query-acc',
597 'Data' => "$acc$version"
599 ) if $acc;
601 # added check for WU-BLAST -echofilter option (bug 2388)
602 elsif (/^>Unfiltered[+-]1$/) {
603 # skip all of the lines of unfiltered sequence
604 while($_ !~ /^Database:/) {
605 $self->debug("Bypassing features line: $_");
606 $_ = $self->_readline;
608 $self->_pushback($_);
610 elsif (/Sequences producing significant alignments:/) {
611 $self->debug("blast.pm: Processing NCBI-BLAST descriptions\n");
612 $flavor = 'ncbi';
614 # PSI-BLAST parsing needs to be fixed to specifically look
615 # for old vs new per iteration, as sorting based on duplication
616 # leads to bugs, see bug 1986
618 # The next line is not necessarily whitespace in psiblast reports.
619 # Also note that we must look for the end of this section by testing
620 # for a line with a leading >. Blank lines occur with this section
621 # for psiblast.
622 if ( !$self->in_element('iteration') ) {
623 $self->start_element( { 'Name' => 'Iteration' } );
625 # these elements are dropped with some multiquery reports; add
626 # back here
627 $self->element(
629 'Name' => 'BlastOutput_db-len',
630 'Data' => $self->{'_blsdb_length'}
632 ) if $self->{'_blsdb_length'};
633 $self->element(
635 'Name' => 'BlastOutput_db-let',
636 'Data' => $self->{'_blsdb_letters'}
638 ) if $self->{'_blsdb_letters'};
639 $self->element(
641 'Name' => 'BlastOutput_db',
642 'Data' => $self->{'_blsdb'}
644 ) if $self->{'_blsdb_letters'};
646 # changed 8/28/2008 to exit hit table if blank line is found after an
647 # appropriate line
648 my $h_regex;
649 my $seen_block;
650 DESCLINE:
651 while ( defined( my $descline = $self->_readline() ) ) {
652 if ($descline =~ m{^\s*$}) {
653 last DESCLINE if $seen_block;
654 next DESCLINE;
656 # any text match is part of block...
657 $seen_block++;
658 # GCG multiline oddness...
659 if ($descline =~ /^(\S+)\s+Begin:\s\d+\s+End:\s+\d+/xms) {
660 my ($id, $nextline) = ($1, $self->_readline);
661 $nextline =~ s{^!}{};
662 $descline = "$id $nextline";
664 # NCBI style hit table (no N)
665 if ($descline =~ /(?<!cor) # negative lookahead
666 (\d*\.?(?:[\+\-eE]+)?\d+) # number (float or scientific notation)
667 \s+ # space
668 (\d*\.?(?:[\+\-eE]+)?\d+) # number (float or scientific notation)
669 \s*$/xms) {
671 my ( $score, $evalue ) = ($1, $2);
673 # Some data clean-up so e-value will appear numeric to perl
674 $evalue =~ s/^e/1e/i;
676 # This to handle no-HSP case
677 my @line = split ' ',$descline;
679 # we want to throw away the score, evalue
680 pop @line, pop @line;
682 # and N if it is present (of course they are not
683 # really in that order, but it doesn't matter
684 if ($3) { pop @line }
686 # add the last 2 entries s.t. we can reconstruct
687 # a minimal Hit object at the end of the day
688 push @hit_signifs, [ $evalue, $score, shift @line, join( ' ', @line ) ];
689 } elsif ($descline =~ /^CONVERGED/i) {
690 $self->element(
692 'Name' => 'Iteration_converged',
693 'Data' => 1
696 } else {
697 $self->_pushback($descline); # Catch leading > (end of section)
698 last DESCLINE;
702 elsif (/Sequences producing High-scoring Segment Pairs:/) {
704 # This block is for WU-BLAST, so we don't have to check for psi-blast stuff
705 # skip the next line
706 $self->debug("blast.pm: Processing WU-BLAST descriptions\n");
707 $_ = $self->_readline();
708 $flavor = 'wu';
710 if ( !$self->in_element('iteration') ) {
711 $self->start_element( { 'Name' => 'Iteration' } );
714 while ( defined( $_ = $self->_readline() )
715 && !/^\s+$/ )
717 my @line = split;
718 pop @line; # throw away first number which is for 'N'col
720 # add the last 2 entries to array s.t. we can reconstruct
721 # a minimal Hit object at the end of the day
722 push @hit_signifs,
723 [ pop @line, pop @line, shift @line, join( ' ', @line ) ];
727 elsif (/^Database:\s*(.+)$/) {
729 $self->debug("blast.pm: Database: $1\n");
730 my $db = $1;
731 while ( defined( $_ = $self->_readline ) ) {
732 if (
733 /^\s+(\-?[\d\,]+|\S+)\s+sequences\;
734 \s+(\-?[\d,]+|\S+)\s+ # Deal with NCBI 2.2.8 OSX problems
735 total\s+letters/ox
738 my ( $s, $l ) = ( $1, $2 );
739 $s =~ s/,//g;
740 $l =~ s/,//g;
741 $self->element(
743 'Name' => 'BlastOutput_db-len',
744 'Data' => $s
747 $self->element(
749 'Name' => 'BlastOutput_db-let',
750 'Data' => $l
753 # cache for next round in cases with multiple queries
754 $self->{'_blsdb'} = $db;
755 $self->{'_blsdb_length'} = $s;
756 $self->{'_blsdb_letters'} = $l;
757 last;
759 else {
760 chomp;
761 $db .= $_;
764 $self->element(
766 'Name' => 'BlastOutput_db',
767 'Data' => $db
771 # bypasses this NCBI blast 2.2.13 extra output for now...
772 # Features in/flanking this part of subject sequence:
773 elsif (/^\sFeatures\s\w+\sthis\spart/xmso) {
774 my $featline;
775 $_ = $self->_readline;
776 while($_ !~ /^\s*$/) {
777 chomp;
778 $featline .= $_;
779 $_ = $self->_readline;
781 $self->_pushback($_);
782 $featline =~ s{(?:^\s+|\s+^)}{}g;
783 $featline =~ s{\n}{;}g;
784 $self->{'_last_hspdata'}->{'Hsp_features'} = $featline;
787 # move inside of a hit
788 elsif (/^>\s*(\S+)\s*(.*)?/) {
789 chomp;
791 $self->debug("blast.pm: Hit: $1\n");
792 $self->in_element('hsp')
793 && $self->end_element( { 'Name' => 'Hsp' } );
794 $self->in_element('hit')
795 && $self->end_element( { 'Name' => 'Hit' } );
797 # special case when bl2seq reports don't have a leading
798 # Query=
799 if ( !$self->within_element('result') ) {
800 $self->_start_blastoutput;
801 $self->start_element( { 'Name' => 'Iteration' } );
803 elsif ( !$self->within_element('iteration') ) {
804 $self->start_element( { 'Name' => 'Iteration' } );
806 $self->start_element( { 'Name' => 'Hit' } );
807 my $id = $1;
808 my $restofline = $2;
810 $self->debug("Starting a hit: $1 $2\n");
811 $self->element(
813 'Name' => 'Hit_id',
814 'Data' => $id
817 my ($gi, $acc, $version ) = $self->_get_seq_identifiers($id);
818 $self->element(
820 'Name' => 'Hit_accession',
821 'Data' => $acc
824 # add hit significance (from the hit table)
825 # this is where Bug 1986 went awry
827 # Changed for Bug2409; hit->significance and hit->score/bits derived
828 # from HSPs, not hit table unless necessary
830 HITTABLE:
831 while (my $v = shift @hit_signifs) {
832 my $tableid = $v->[2];
833 if ($tableid !~ m{\Q$id\E}) {
834 $self->debug("Hit table ID $tableid doesn't match current hit id $id, checking next hit table entry...\n");
835 next HITTABLE;
836 } else {
837 last HITTABLE;
840 while ( defined( $_ = $self->_readline() ) ) {
841 next if (/^\s+$/);
842 chomp;
843 if (/Length\s*=\s*([\d,]+)/) {
844 my $l = $1;
845 $l =~ s/\,//g;
846 $self->element(
848 'Name' => 'Hit_len',
849 'Data' => $l
852 last;
854 else {
855 s/^\s(?!\s)/\x01/; #new line to concatenate desc lines with <soh>
856 $restofline .= $_;
859 $restofline =~ s/\s+/ /g;
860 $self->element(
862 'Name' => 'Hit_def',
863 'Data' => $restofline
867 elsif (/\s+(Plus|Minus) Strand HSPs:/i) {
868 next;
870 elsif (
871 ( $self->in_element('hit') || $self->in_element('hsp') )
872 && # paracel genewise BTK
873 m/Score\s*=\s*(\S+)\s*bits\s* # Bit score
874 (?:\((\d+)\))?, # Raw score
875 \s+Log\-Length\sScore\s*=\s*(\d+) # Log-Length score
879 $self->in_element('hsp')
880 && $self->end_element( { 'Name' => 'Hsp' } );
881 $self->start_element( { 'Name' => 'Hsp' } );
883 $self->debug( "Got paracel genewise HSP score=$1\n");
885 # Some data clean-up so e-value will appear numeric to perl
886 my ( $bits, $score, $evalue ) = ( $1, $2, $3 );
887 $evalue =~ s/^e/1e/i;
888 $self->element(
890 'Name' => 'Hsp_score',
891 'Data' => $score
894 $self->element(
896 'Name' => 'Hsp_bit-score',
897 'Data' => $bits
900 $self->element(
902 'Name' => 'Hsp_evalue',
903 'Data' => $evalue
907 elsif (
908 ( $self->in_element('hit') || $self->in_element('hsp') )
909 && # paracel hframe BTK
910 m/Score\s*=\s*([^,\s]+), # Raw score
911 \s*Expect\s*=\s*([^,\s]+), # E-value
912 \s*P(?:\(\S+\))?\s*=\s*([^,\s]+) # P-value
916 $self->in_element('hsp')
917 && $self->end_element( { 'Name' => 'Hsp' } );
918 $self->start_element( { 'Name' => 'Hsp' } );
920 $self->debug( "Got paracel hframe HSP score=$1\n");
922 # Some data clean-up so e-value will appear numeric to perl
923 my ( $score, $evalue, $pvalue ) = ( $1, $2, $3 );
924 $evalue = "1$evalue" if $evalue =~ /^e/;
925 $pvalue = "1$pvalue" if $pvalue =~ /^e/;
927 $self->element(
929 'Name' => 'Hsp_score',
930 'Data' => $score
933 $self->element(
935 'Name' => 'Hsp_evalue',
936 'Data' => $evalue
939 $self->element(
941 'Name' => 'Hsp_pvalue',
942 'Data' => $pvalue
946 elsif (
947 ( $self->in_element('hit') || $self->in_element('hsp') )
948 && # wublast
949 m/Score\s*=\s*(\S+)\s* # Bit score
950 \(([\d\.]+)\s*bits\), # Raw score
951 \s*Expect\s*=\s*([^,\s]+), # E-value
952 \s*(?:Sum)?\s* # SUM
953 P(?:\(\d+\))?\s*=\s*([^,\s]+) # P-value
954 (?:\s*,\s+Group\s*\=\s*(\d+))? # HSP Group
957 { # wu-blast HSP parse
958 $self->in_element('hsp')
959 && $self->end_element( { 'Name' => 'Hsp' } );
960 $self->start_element( { 'Name' => 'Hsp' } );
962 # Some data clean-up so e-value will appear numeric to perl
963 my ( $score, $bits, $evalue, $pvalue, $group ) =
964 ( $1, $2, $3, $4, $5 );
965 $evalue =~ s/^e/1e/i;
966 $pvalue =~ s/^e/1e/i;
968 $self->element(
970 'Name' => 'Hsp_score',
971 'Data' => $score
974 $self->element(
976 'Name' => 'Hsp_bit-score',
977 'Data' => $bits
980 $self->element(
982 'Name' => 'Hsp_evalue',
983 'Data' => $evalue
986 $self->element(
988 'Name' => 'Hsp_pvalue',
989 'Data' => $pvalue
993 if ( defined $group ) {
994 $self->element(
996 'Name' => 'Hsp_group',
997 'Data' => $group
1003 elsif (
1004 ( $self->in_element('hit') || $self->in_element('hsp') )
1005 && # ncbi blast, works with 2.2.17
1006 m/Score\s*=\s*(\S+)\s*bits\s* # Bit score
1007 (?:\((\d+)\))?, # Missing for BLAT pseudo-BLAST fmt
1008 \s*Expect(?:\((\d+\+?)\))?\s*=\s*([^,\s]+) # E-value
1011 { # parse NCBI blast HSP
1012 $self->in_element('hsp')
1013 && $self->end_element( { 'Name' => 'Hsp' } );
1015 # Some data clean-up so e-value will appear numeric to perl
1016 my ( $bits, $score, $n, $evalue ) = ( $1, $2, $3, $4 );
1017 $evalue =~ s/^e/1e/i;
1018 $self->start_element( { 'Name' => 'Hsp' } );
1019 $self->element(
1021 'Name' => 'Hsp_score',
1022 'Data' => $score
1025 $self->element(
1027 'Name' => 'Hsp_bit-score',
1028 'Data' => $bits
1031 $self->element(
1033 'Name' => 'Hsp_evalue',
1034 'Data' => $evalue
1037 $self->element(
1039 'Name' => 'Hsp_n',
1040 'Data' => $n
1042 ) if defined $n;
1043 $score = '' unless defined $score; # deal with BLAT which
1044 # has no score only bits
1045 $self->debug("Got NCBI HSP score=$score, evalue $evalue\n");
1047 elsif (
1048 $self->in_element('hsp')
1049 && m/Identities\s*=\s*(\d+)\s*\/\s*(\d+)\s*[\d\%\(\)]+\s*
1050 (?:,\s*Positives\s*=\s*(\d+)\/(\d+)\s*[\d\%\(\)]+\s*)? # pos only valid for Protein alignments
1051 (?:\,\s*Gaps\s*=\s*(\d+)\/(\d+))? # Gaps
1052 /oxi
1055 $self->element(
1057 'Name' => 'Hsp_identity',
1058 'Data' => $1
1061 $self->element(
1063 'Name' => 'Hsp_align-len',
1064 'Data' => $2
1067 if ( defined $3 ) {
1068 $self->element(
1070 'Name' => 'Hsp_positive',
1071 'Data' => $3
1075 else {
1076 $self->element(
1078 'Name' => 'Hsp_positive',
1079 'Data' => $1
1083 if ( defined $6 ) {
1084 $self->element(
1086 'Name' => 'Hsp_gaps',
1087 'Data' => $5
1092 $self->{'_Query'} = { 'begin' => 0, 'end' => 0 };
1093 $self->{'_Sbjct'} = { 'begin' => 0, 'end' => 0 };
1095 if (/(Frame\s*=\s*.+)$/) {
1097 # handle wu-blast Frame listing on same line
1098 $self->_pushback($1);
1101 elsif ( $self->in_element('hsp')
1102 && /Strand\s*=\s*(Plus|Minus)\s*\/\s*(Plus|Minus)/i )
1105 # consume this event ( we infer strand from start/end)
1106 unless ($reporttype) {
1107 $self->{'_reporttype'} = $reporttype = 'BLASTN';
1108 $bl2seq_fix = 1; # special case to resubmit the algorithm
1109 # reporttype
1111 next;
1113 elsif ( $self->in_element('hsp')
1114 && /Links\s*=\s*(\S+)/ox )
1116 $self->element(
1118 'Name' => 'Hsp_links',
1119 'Data' => $1
1123 elsif ( $self->in_element('hsp')
1124 && /Frame\s*=\s*([\+\-][1-3])\s*(\/\s*([\+\-][1-3]))?/ )
1127 # this is for bl2seq only
1128 unless ( defined $reporttype ) {
1129 $bl2seq_fix = 1;
1130 if ( $1 && $2 ) { $reporttype = 'TBLASTX' }
1131 else {
1132 $reporttype = 'BLASTX';
1134 # we can't distinguish between BLASTX and TBLASTN straight from the report }
1136 $self->{'_reporttype'} = $reporttype;
1139 my ( $queryframe, $hitframe );
1140 if ( $reporttype eq 'TBLASTX' ) {
1141 ( $queryframe, $hitframe ) = ( $1, $2 );
1142 $hitframe =~ s/\/\s*//g;
1144 elsif ( $reporttype eq 'TBLASTN' || $reporttype eq 'PSITBLASTN') {
1145 ( $hitframe, $queryframe ) = ( $1, 0 );
1147 elsif ( $reporttype eq 'BLASTX' || $reporttype eq 'RPS-BLAST(BLASTP)') {
1148 ( $queryframe, $hitframe ) = ( $1, 0 );
1149 # though NCBI doesn't report it, this is a special BLASTX-like
1150 # RPS-BLAST; should be handled differently
1151 if ($reporttype eq 'RPS-BLAST(BLASTP)') {
1152 $self->element(
1154 'Name' => 'BlastOutput_program',
1155 'Data' => 'RPS-BLAST(BLASTX)'
1160 $self->element(
1162 'Name' => 'Hsp_query-frame',
1163 'Data' => $queryframe
1167 $self->element(
1169 'Name' => 'Hsp_hit-frame',
1170 'Data' => $hitframe
1174 elsif (/^Parameters:/
1175 || /^\s+Database:\s+?/
1176 || /^\s+Subset/
1177 || /^\s*Lambda/
1178 || /^\s*Histogram/
1179 || ( $self->in_element('hsp') && /WARNING|NOTE/ ) )
1182 # Note: Lambda check was necessary to parse
1183 # t/data/ecoli_domains.rpsblast AND to parse bl2seq
1184 $self->debug("blast.pm: found parameters section \n");
1186 $self->in_element('hsp')
1187 && $self->end_element( { 'Name' => 'Hsp' } );
1188 $self->in_element('hit')
1189 && $self->end_element( { 'Name' => 'Hit' } );
1191 # This is for the case when we specify -b 0 (or B=0 for WU-BLAST)
1192 # and still want to construct minimal Hit objects
1193 $self->_cleanup_hits(\@hit_signifs) if scalar(@hit_signifs);
1194 $self->within_element('iteration')
1195 && $self->end_element( { 'Name' => 'Iteration' } );
1197 next if /^\s+Subset/;
1198 my $blast = (/^(\s+Database\:)|(\s*Lambda)/) ? 'ncbi' : 'wublast';
1199 if (/^\s*Histogram/) {
1200 $blast = 'btk';
1203 my $last = '';
1205 # default is that gaps are allowed
1206 $self->element(
1208 'Name' => 'Parameters_allowgaps',
1209 'Data' => 'yes'
1212 while ( defined( $_ = $self->_readline ) ) {
1213 if (
1214 /^((?:\S+)?BLAST[NPX]?)\s+(.+)$/i # NCBI BLAST, PSIBLAST
1215 # RPSBLAST, MEGABLAST
1216 || /^(P?GENEWISE|HFRAME|SWN|TSWN)\s+(.+)/i #Paracel BTK
1219 $self->_pushback($_);
1221 # let's handle this in the loop
1222 last;
1224 elsif (/^Query=/) {
1225 $self->_pushback($reportline) if $reportline;
1226 $self->_pushback($_);
1227 last PARSER;
1230 # here is where difference between wublast and ncbiblast
1231 # is better handled by different logic
1232 if ( /Number of Sequences:\s+([\d\,]+)/i
1233 || /of sequences in database:\s+(\-?[\d,]+)/i )
1235 my $c = $1;
1236 $c =~ s/\,//g;
1237 $self->element(
1239 'Name' => 'Statistics_db-len',
1240 'Data' => $c
1244 elsif (/letters in database:\s+(\-?[\d,]+)/i) {
1245 my $s = $1;
1246 $s =~ s/,//g;
1247 $self->element(
1249 'Name' => 'Statistics_db-let',
1250 'Data' => $s
1254 elsif ( $blast eq 'btk' ) {
1255 next;
1257 elsif ( $blast eq 'wublast' ) {
1259 # warn($_);
1260 if (/E=(\S+)/) {
1261 $self->element(
1263 'Name' => 'Parameters_expect',
1264 'Data' => $1
1268 elsif (/nogaps/) {
1269 $self->element(
1271 'Name' => 'Parameters_allowgaps',
1272 'Data' => 'no'
1276 elsif (/ctxfactor=(\S+)/) {
1277 $self->element(
1279 'Name' => 'Statistics_ctxfactor',
1280 'Data' => $1
1284 elsif (
1285 /(postsw|links|span[12]?|warnings|notes|gi|noseqs|qres|qype)/
1288 $self->element(
1290 'Name' => "Parameters_$1",
1291 'Data' => 'yes'
1295 elsif (/(\S+)=(\S+)/) {
1296 $self->element(
1298 'Name' => "Parameters_$1",
1299 'Data' => $2
1303 elsif ( $last =~ /(Frame|Strand)\s+MatID\s+Matrix name/i ) {
1304 my $firstgapinfo = 1;
1305 my $frame = undef;
1306 while ( defined($_) && !/^\s+$/ ) {
1307 s/^\s+//;
1308 s/\s+$//;
1309 if ( $firstgapinfo
1310 && s/Q=(\d+),R=(\d+)\s+//x )
1312 $firstgapinfo = 0;
1314 $self->element(
1316 'Name' => 'Parameters_gap-open',
1317 'Data' => $1
1320 $self->element(
1322 'Name' => 'Parameters_gap-extend',
1323 'Data' => $2
1326 my @fields = split;
1328 for my $type (
1329 qw(lambda_gapped
1330 kappa_gapped
1331 entropy_gapped)
1334 next if $type eq 'n/a';
1335 if ( !@fields ) {
1336 warn "fields is empty for $type\n";
1337 next;
1339 $self->element(
1341 'Name' =>
1342 "Statistics_frame$frame\_$type",
1343 'Data' => shift @fields
1348 else {
1349 my ( $frameo, $matid, $matrix, @fields ) =
1350 split;
1351 if ( !defined $frame ) {
1353 # keep some sort of default feature I guess
1354 # even though this is sort of wrong
1355 $self->element(
1357 'Name' => 'Parameters_matrix',
1358 'Data' => $matrix
1361 $self->element(
1363 'Name' => 'Statistics_lambda',
1364 'Data' => $fields[0]
1367 $self->element(
1369 'Name' => 'Statistics_kappa',
1370 'Data' => $fields[1]
1373 $self->element(
1375 'Name' => 'Statistics_entropy',
1376 'Data' => $fields[2]
1380 $frame = $frameo;
1381 my $ii = 0;
1382 for my $type (
1383 qw(lambda_used
1384 kappa_used
1385 entropy_used
1386 lambda_computed
1387 kappa_computed
1388 entropy_computed)
1391 my $f = $fields[$ii];
1392 next unless defined $f; # deal with n/a
1393 if ( $f eq 'same' ) {
1394 $f = $fields[ $ii - 3 ];
1396 $ii++;
1397 $self->element(
1399 'Name' =>
1400 "Statistics_frame$frame\_$type",
1401 'Data' => $f
1408 # get the next line
1409 $_ = $self->_readline;
1411 $last = $_;
1413 elsif ( $last =~ /(Frame|Strand)\s+MatID\s+Length/i ) {
1414 my $frame = undef;
1415 while ( defined($_) && !/^\s+/ ) {
1416 s/^\s+//;
1417 s/\s+$//;
1418 my @fields = split;
1419 if ( @fields <= 3 ) {
1420 for my $type (qw(X_gapped E2_gapped S2)) {
1421 last unless @fields;
1422 $self->element(
1424 'Name' =>
1425 "Statistics_frame$frame\_$type",
1426 'Data' => shift @fields
1431 else {
1433 for my $type (
1434 qw(length
1435 efflength
1436 E S W T X E2 S2)
1439 $self->element(
1441 'Name' =>
1442 "Statistics_frame$frame\_$type",
1443 'Data' => shift @fields
1448 $_ = $self->_readline;
1450 $last = $_;
1452 elsif (/(\S+\s+\S+)\s+DFA:\s+(\S+)\s+\((.+)\)/) {
1453 if ( $1 eq 'states in' ) {
1454 $self->element(
1456 'Name' => 'Statistics_DFA_states',
1457 'Data' => "$2 $3"
1461 elsif ( $1 eq 'size of' ) {
1462 $self->element(
1464 'Name' => 'Statistics_DFA_size',
1465 'Data' => "$2 $3"
1470 elsif (
1471 m/^\s+Time to generate neighborhood:\s+
1472 (\S+\s+\S+\s+\S+)/x
1475 $self->element(
1477 'Name' => 'Statistics_neighbortime',
1478 'Data' => $1
1482 elsif (/processors\s+used:\s+(\d+)/) {
1483 $self->element(
1485 'Name' => 'Statistics_noprocessors',
1486 'Data' => $1
1490 elsif (
1491 m/^\s+(\S+)\s+cpu\s+time:\s+# cputype
1492 (\S+\s+\S+\s+\S+) # cputime
1493 \s+Elapsed:\s+(\S+)/x
1496 my $cputype = lc($1);
1497 $self->element(
1499 'Name' => "Statistics_$cputype\_cputime",
1500 'Data' => $2
1503 $self->element(
1505 'Name' => "Statistics_$cputype\_actualtime",
1506 'Data' => $3
1510 elsif (/^\s+Start:/) {
1511 my ( $junk, $start, $stime, $end, $etime ) =
1512 split( /\s+(Start|End)\:\s+/, $_ );
1513 chomp($stime);
1514 $self->element(
1516 'Name' => 'Statistics_starttime',
1517 'Data' => $stime
1520 chomp($etime);
1521 $self->element(
1523 'Name' => 'Statistics_endtime',
1524 'Data' => $etime
1528 elsif (/^\s+Database:\s+(.+)$/) {
1529 $self->element(
1531 'Name' => 'Parameters_full_dbpath',
1532 'Data' => $1
1537 elsif (/^\s+Posted:\s+(.+)/) {
1538 my $d = $1;
1539 chomp($d);
1540 $self->element(
1542 'Name' => 'Statistics_posted_date',
1543 'Data' => $d
1548 elsif ( $blast eq 'ncbi' ) {
1550 if (m/^Matrix:\s+(.+)\s*$/oxi) {
1551 $self->element(
1553 'Name' => 'Parameters_matrix',
1554 'Data' => $1
1558 elsif (/^Gapped/) {
1559 $gapped_stats = 1;
1561 elsif (/^Lambda/) {
1562 $_ = $self->_readline;
1563 s/^\s+//;
1564 my ( $lambda, $kappa, $entropy ) = split;
1565 if ($gapped_stats) {
1566 $self->element(
1568 'Name' => "Statistics_gapped_lambda",
1569 'Data' => $lambda
1572 $self->element(
1574 'Name' => "Statistics_gapped_kappa",
1575 'Data' => $kappa
1578 $self->element(
1580 'Name' => "Statistics_gapped_entropy",
1581 'Data' => $entropy
1585 else {
1586 $self->element(
1588 'Name' => "Statistics_lambda",
1589 'Data' => $lambda
1592 $self->element(
1594 'Name' => "Statistics_kappa",
1595 'Data' => $kappa
1598 $self->element(
1600 'Name' => "Statistics_entropy",
1601 'Data' => $entropy
1606 elsif (m/effective\s+search\s+space\s+used:\s+(\d+)/ox) {
1607 $self->element(
1609 'Name' => 'Statistics_eff-spaceused',
1610 'Data' => $1
1614 elsif (m/effective\s+search\s+space:\s+(\d+)/ox) {
1615 $self->element(
1617 'Name' => 'Statistics_eff-space',
1618 'Data' => $1
1622 elsif (
1623 m/Gap\s+Penalties:\s+Existence:\s+(\d+)\,
1624 \s+Extension:\s+(\d+)/ox
1627 $self->element(
1629 'Name' => 'Parameters_gap-open',
1630 'Data' => $1
1633 $self->element(
1635 'Name' => 'Parameters_gap-extend',
1636 'Data' => $2
1640 elsif (/effective\s+HSP\s+length:\s+(\d+)/) {
1641 $self->element(
1643 'Name' => 'Statistics_hsp-len',
1644 'Data' => $1
1648 elsif (/effective\s+length\s+of\s+query:\s+([\d\,]+)/) {
1649 my $c = $1;
1650 $c =~ s/\,//g;
1651 $self->element(
1653 'Name' => 'Statistics_query-len',
1654 'Data' => $c
1658 elsif (/effective\s+length\s+of\s+database:\s+([\d\,]+)/) {
1659 my $c = $1;
1660 $c =~ s/\,//g;
1661 $self->element(
1663 'Name' => 'Statistics_eff-dblen',
1664 'Data' => $c
1668 elsif (
1669 /^(T|A|X1|X2|X3|S1|S2):\s+(\d+(\.\d+)?)\s+(?:\(\s*(\d+\.\d+) bits\))?/
1672 my $v = $2;
1673 chomp($v);
1674 $self->element(
1676 'Name' => "Statistics_$1",
1677 'Data' => $v
1680 if ( defined $4 ) {
1681 $self->element(
1683 'Name' => "Statistics_$1_bits",
1684 'Data' => $4
1689 elsif (
1690 m/frameshift\s+window\,
1691 \s+decay\s+const:\s+(\d+)\,\s+([\.\d]+)/x
1694 $self->element(
1696 'Name' => 'Statistics_framewindow',
1697 'Data' => $1
1700 $self->element(
1702 'Name' => 'Statistics_decay',
1703 'Data' => $2
1707 elsif (m/^Number\s+of\s+Hits\s+to\s+DB:\s+(\S+)/ox) {
1708 $self->element(
1710 'Name' => 'Statistics_hit_to_db',
1711 'Data' => $1
1715 elsif (m/^Number\s+of\s+extensions:\s+(\S+)/ox) {
1716 $self->element(
1718 'Name' => 'Statistics_num_extensions',
1719 'Data' => $1
1723 elsif (
1724 m/^Number\s+of\s+successful\s+extensions:\s+
1725 (\S+)/ox
1728 $self->element(
1730 'Name' => 'Statistics_num_suc_extensions',
1731 'Data' => $1
1735 elsif (
1736 m/^Number\s+of\s+sequences\s+better\s+than\s+
1737 (\S+):\s+(\d+)/ox
1740 $self->element(
1742 'Name' => 'Parameters_expect',
1743 'Data' => $1
1746 $self->element(
1748 'Name' => 'Statistics_seqs_better_than_cutoff',
1749 'Data' => $2
1753 elsif (/^\s+Posted\s+date:\s+(.+)/) {
1754 my $d = $1;
1755 chomp($d);
1756 $self->element(
1758 'Name' => 'Statistics_posted_date',
1759 'Data' => $d
1763 elsif ( !/^\s+$/ ) {
1764 #$self->debug( "unmatched stat $_");
1767 $last = $_;
1769 } elsif ( $self->in_element('hsp') ) {
1770 $self->debug("blast.pm: Processing HSP\n");
1771 # let's read 3 lines at a time;
1772 # bl2seq hackiness... Not sure I like
1773 $self->{'_reporttype'} ||= $DEFAULTREPORTTYPE;
1774 my %data = (
1775 'Query' => '',
1776 'Mid' => '',
1777 'Hit' => ''
1779 my $len;
1780 for ( my $i = 0 ; defined($_) && $i < 3 ; $i++ ) {
1781 # $self->debug("$i: $_") if $v;
1782 if ( ( $i == 0 && /^\s+$/) ||
1783 /^\s*(?:Lambda|Minus|Plus|Score)/i )
1785 $self->_pushback($_) if defined $_;
1786 $self->end_element( { 'Name' => 'Hsp' } );
1787 last;
1789 chomp;
1790 if (/^((Query|Sbjct):?\s+(\-?\d+)\s*)(\S+)\s+(\-?\d+)/) {
1791 my ( $full, $type, $start, $str, $end ) =
1792 ( $1, $2, $3, $4, $5 );
1794 if ( $str eq '-' ) {
1795 $i = 3 if $type eq 'Sbjct';
1797 else {
1798 $data{$type} = $str;
1800 $len = length($full);
1801 $self->{"\_$type"}->{'begin'} = $start
1802 unless $self->{"_$type"}->{'begin'};
1803 $self->{"\_$type"}->{'end'} = $end;
1804 } else {
1805 $self->throw("no data for midline $_")
1806 unless ( defined $_ && defined $len );
1807 $data{'Mid'} = substr( $_, $len );
1809 $_ = $self->_readline();
1811 $self->characters(
1813 'Name' => 'Hsp_qseq',
1814 'Data' => $data{'Query'}
1817 $self->characters(
1819 'Name' => 'Hsp_hseq',
1820 'Data' => $data{'Sbjct'}
1823 $self->characters(
1825 'Name' => 'Hsp_midline',
1826 'Data' => $data{'Mid'}
1830 else {
1831 #$self->debug("blast.pm: unrecognized line $_");
1835 $self->debug("blast.pm: End of BlastOutput\n");
1836 if ( $self->{'_seentop'} ) {
1837 $self->within_element('hsp')
1838 && $self->end_element( { 'Name' => 'Hsp' } );
1839 $self->within_element('hit')
1840 && $self->end_element( { 'Name' => 'Hit' } );
1841 # cleanup extra hits
1842 $self->_cleanup_hits(\@hit_signifs) if scalar(@hit_signifs);
1843 $self->within_element('iteration')
1844 && $self->end_element( { 'Name' => 'Iteration' } );
1845 if ($bl2seq_fix) {
1846 $self->element(
1848 'Name' => 'BlastOutput_program',
1849 'Data' => $reporttype
1853 $self->end_element( { 'Name' => 'BlastOutput' } );
1855 return $self->end_document();
1858 # Private method for internal use only.
1859 sub _start_blastoutput {
1860 my $self = shift;
1861 $self->start_element( { 'Name' => 'BlastOutput' } );
1862 $self->{'_seentop'} = 1;
1863 $self->{'_result_count'}++;
1864 $self->{'_handler_rc'} = undef;
1867 =head2 _will_handle
1869 Title : _will_handle
1870 Usage : Private method. For internal use only.
1871 if( $self->_will_handle($type) ) { ... }
1872 Function: Provides an optimized way to check whether or not an element of a
1873 given type is to be handled.
1874 Returns : Reference to EventHandler object if the element type is to be handled.
1875 undef if the element type is not to be handled.
1876 Args : string containing type of element.
1878 Optimizations:
1880 =over 2
1882 =item 1
1884 Using the cached pointer to the EventHandler to minimize repeated
1885 lookups.
1887 =item 2
1889 Caching the will_handle status for each type that is encountered so
1890 that it only need be checked by calling
1891 handler-E<gt>will_handle($type) once.
1893 =back
1895 This does not lead to a major savings by itself (only 5-10%). In
1896 combination with other optimizations, or for large parse jobs, the
1897 savings good be significant.
1899 To test against the unoptimized version, remove the parentheses from
1900 around the third term in the ternary " ? : " operator and add two
1901 calls to $self-E<gt>_eventHandler().
1903 =cut
1905 sub _will_handle {
1906 my ( $self, $type ) = @_;
1907 my $handler = $self->{'_handler_cache'};
1908 my $will_handle =
1909 defined( $self->{'_will_handle_cache'}->{$type} )
1910 ? $self->{'_will_handle_cache'}->{$type}
1911 : ( $self->{'_will_handle_cache'}->{$type} =
1912 $handler->will_handle($type) );
1914 return $will_handle ? $handler : undef;
1917 =head2 start_element
1919 Title : start_element
1920 Usage : $eventgenerator->start_element
1921 Function: Handles a start element event
1922 Returns : none
1923 Args : hashref with at least 2 keys 'Data' and 'Name'
1925 =cut
1927 sub start_element {
1928 my ( $self, $data ) = @_;
1930 # we currently don't care about attributes
1931 my $nm = $data->{'Name'};
1932 my $type = $MODEMAP{$nm};
1933 if ($type) {
1934 my $handler = $self->_will_handle($type);
1935 if ($handler) {
1936 my $func = sprintf( "start_%s", lc $type );
1937 $self->{'_handler_rc'} = $handler->$func( $data->{'Attributes'} );
1939 #else {
1940 #$self->debug( # changed 4/29/2006 to play nice with other event handlers
1941 # "Bio::SearchIO::InternalParserError ".
1942 # "\nCan't handle elements of type \'$type.\'"
1945 unshift @{ $self->{'_elements'} }, $type;
1946 if ( $type eq 'result' ) {
1947 $self->{'_values'} = {};
1948 $self->{'_result'} = undef;
1949 } else {
1950 # cleanup some things
1951 if ( defined $self->{'_values'} ) {
1952 foreach my $k (
1953 grep { /^\U$type\-/ }
1954 keys %{ $self->{'_values'} }
1957 delete $self->{'_values'}->{$k};
1964 =head2 end_element
1966 Title : end_element
1967 Usage : $eventgenerator->end_element
1968 Function: Handles an end element event
1969 Returns : hashref with an element's worth of data
1970 Args : hashref with at least 2 keys 'Data' and 'Name'
1973 =cut
1975 sub end_element {
1976 my ( $self, $data ) = @_;
1978 my $nm = $data->{'Name'};
1979 my $type;
1980 my $rc;
1981 if ( $nm eq 'BlastOutput_program' ) {
1982 if ( $self->{'_last_data'} =~ /(t?blast[npx])/i ) {
1983 $self->{'_reporttype'} = uc $1;
1985 $self->{'_reporttype'} ||= $DEFAULTREPORTTYPE;
1988 # Hsps are sort of weird, in that they end when another
1989 # object begins so have to detect this in end_element for now
1990 if ( $nm eq 'Hsp' ) {
1991 foreach (qw(Hsp_qseq Hsp_midline Hsp_hseq Hsp_features)) {
1992 $self->element(
1994 'Name' => $_,
1995 'Data' => $self->{'_last_hspdata'}->{$_}
1997 ) if defined $self->{'_last_hspdata'}->{$_};
1999 $self->{'_last_hspdata'} = {};
2000 $self->element(
2002 'Name' => 'Hsp_query-from',
2003 'Data' => $self->{'_Query'}->{'begin'}
2006 $self->element(
2008 'Name' => 'Hsp_query-to',
2009 'Data' => $self->{'_Query'}->{'end'}
2013 $self->element(
2015 'Name' => 'Hsp_hit-from',
2016 'Data' => $self->{'_Sbjct'}->{'begin'}
2019 $self->element(
2021 'Name' => 'Hsp_hit-to',
2022 'Data' => $self->{'_Sbjct'}->{'end'}
2026 # } elsif( $nm eq 'Iteration' ) {
2027 # Nothing special needs to be done here.
2029 if ( $type = $MODEMAP{$nm} ) {
2030 my $handler = $self->_will_handle($type);
2031 if ($handler) {
2032 my $func = sprintf( "end_%s", lc $type );
2033 $rc = $handler->$func( $self->{'_reporttype'}, $self->{'_values'} );
2035 shift @{ $self->{'_elements'} };
2038 elsif ( $MAPPING{$nm} ) {
2039 if ( ref( $MAPPING{$nm} ) =~ /hash/i ) {
2041 # this is where we shove in the data from the
2042 # hashref info about params or statistics
2043 my $key = ( keys %{ $MAPPING{$nm} } )[0];
2044 $self->{'_values'}->{$key}->{ $MAPPING{$nm}->{$key} } =
2045 $self->{'_last_data'};
2047 else {
2048 $self->{'_values'}->{ $MAPPING{$nm} } = $self->{'_last_data'};
2051 else {
2052 #$self->debug("blast.pm: unknown nm $nm, ignoring\n");
2054 $self->{'_last_data'} = ''; # remove read data if we are at
2055 # end of an element
2056 $self->{'_result'} = $rc if ( defined $type && $type eq 'result' );
2057 return $rc;
2060 =head2 element
2062 Title : element
2063 Usage : $eventhandler->element({'Name' => $name, 'Data' => $str});
2064 Function: Convenience method that calls start_element, characters, end_element
2065 Returns : none
2066 Args : Hash ref with the keys 'Name' and 'Data'
2069 =cut
2071 sub element {
2072 my ( $self, $data ) = @_;
2073 # Note that start element isn't needed for character data
2074 # Not too SAX-y, though
2075 #$self->start_element($data);
2076 $self->characters($data);
2077 $self->end_element($data);
2080 =head2 characters
2082 Title : characters
2083 Usage : $eventgenerator->characters($str)
2084 Function: Send a character events
2085 Returns : none
2086 Args : string
2089 =cut
2091 sub characters {
2092 my ( $self, $data ) = @_;
2093 if ( $self->in_element('hsp')
2094 && $data->{'Name'} =~ /^Hsp\_(qseq|hseq|midline)$/ )
2096 $self->{'_last_hspdata'}->{ $data->{'Name'} } .= $data->{'Data'}
2097 if defined $data->{'Data'};
2099 return unless ( defined $data->{'Data'} && $data->{'Data'} !~ /^\s+$/ );
2100 $self->{'_last_data'} = $data->{'Data'};
2103 =head2 within_element
2105 Title : within_element
2106 Usage : if( $eventgenerator->within_element($element) ) {}
2107 Function: Test if we are within a particular element
2108 This is different than 'in' because within can be tested
2109 for a whole block.
2110 Returns : boolean
2111 Args : string element name
2113 See Also: L<in_element>
2115 =cut
2117 sub within_element {
2118 my ( $self, $name ) = @_;
2119 return 0
2120 if ( !defined $name && !defined $self->{'_elements'}
2121 || scalar @{ $self->{'_elements'} } == 0 );
2122 foreach ( @{ $self->{'_elements'} } ) {
2123 if ( $_ eq $name ) {
2124 return 1;
2127 return 0;
2130 =head2 in_element
2132 Title : in_element
2133 Usage : if( $eventgenerator->in_element($element) ) {}
2134 Function: Test if we are in a particular element
2135 This is different than 'within_element' because within
2136 can be tested for a whole block.
2137 Returns : boolean
2138 Args : string element name
2140 See Also: L<within_element>
2142 =cut
2144 sub in_element {
2145 my ( $self, $name ) = @_;
2146 return 0 if !defined $self->{'_elements'}->[0];
2147 return ( $self->{'_elements'}->[0] eq $name );
2150 =head2 start_document
2152 Title : start_document
2153 Usage : $eventgenerator->start_document
2154 Function: Handle a start document event
2155 Returns : none
2156 Args : none
2159 =cut
2161 sub start_document {
2162 my ($self) = @_;
2163 $self->{'_lasttype'} = '';
2164 $self->{'_values'} = {};
2165 $self->{'_result'} = undef;
2166 $self->{'_elements'} = [];
2169 =head2 end_document
2171 Title : end_document
2172 Usage : $eventgenerator->end_document
2173 Function: Handles an end document event
2174 Returns : Bio::Search::Result::ResultI object
2175 Args : none
2178 =cut
2180 sub end_document {
2181 my ( $self, @args ) = @_;
2183 #$self->debug("blast.pm: end_document\n");
2184 return $self->{'_result'};
2187 sub write_result {
2188 my ( $self, $blast, @args ) = @_;
2190 if ( not defined( $self->writer ) ) {
2191 $self->warn("Writer not defined. Using a $DEFAULT_BLAST_WRITER_CLASS");
2192 $self->writer( $DEFAULT_BLAST_WRITER_CLASS->new() );
2194 $self->SUPER::write_result( $blast, @args );
2197 sub result_count {
2198 my $self = shift;
2199 return $self->{'_result_count'};
2202 sub report_count { shift->result_count }
2204 =head2 inclusion_threshold
2206 Title : inclusion_threshold
2207 Usage : my $incl_thresh = $isreb->inclusion_threshold;
2208 : $isreb->inclusion_threshold(1e-5);
2209 Function: Get/Set the e-value threshold for inclusion in the PSI-BLAST
2210 score matrix model (blastpgp) that was used for generating the reports
2211 being parsed.
2212 Returns : number (real)
2213 Default value: $Bio::SearchIO::IteratedSearchResultEventBuilder::DEFAULT_INCLUSION_THRESHOLD
2214 Args : number (real) (e.g., 0.0001 or 1e-4 )
2216 =cut
2218 # Delegates to the event handler.
2219 sub inclusion_threshold {
2220 shift->_eventHandler->inclusion_threshold(@_);
2223 =head2 max_significance
2225 Usage : $obj->max_significance();
2226 Purpose : Set/Get the P or Expect value used as significance screening cutoff.
2227 This is the value of the -signif parameter supplied to new().
2228 Hits with P or E-value above this are skipped.
2229 Returns : Scientific notation number with this format: 1.0e-05.
2230 Argument : Scientific notation number or float (when setting)
2231 Comments : Screening of significant hits uses the data provided on the
2232 : description line. For NCBI BLAST1 and WU-BLAST, this data
2233 : is P-value. for NCBI BLAST2 it is an Expect value.
2235 =cut
2237 sub max_significance { shift->{'_handler_cache'}->max_significance(@_) }
2239 =head2 signif
2241 Synonym for L<max_significance()|max_significance>
2243 =cut
2245 sub signif { shift->max_significance(@_) }
2247 =head2 min_score
2249 Usage : $obj->min_score();
2250 Purpose : Set/Get the Blast score used as screening cutoff.
2251 This is the value of the -score parameter supplied to new().
2252 Hits with scores below this are skipped.
2253 Returns : Integer or scientific notation number.
2254 Argument : Integer or scientific notation number (when setting)
2255 Comments : Screening of significant hits uses the data provided on the
2256 : description line.
2258 =cut
2260 sub min_score { shift->{'_handler_cache'}->max_significance(@_) }
2262 =head2 min_query_length
2264 Usage : $obj->min_query_length();
2265 Purpose : Gets the query sequence length used as screening criteria.
2266 This is the value of the -min_query_len parameter supplied to new().
2267 Hits with sequence length below this are skipped.
2268 Returns : Integer
2269 Argument : n/a
2271 =cut
2273 sub min_query_length {
2274 my $self = shift;
2275 if (@_) {
2276 my $min_qlen = shift;
2277 if ( $min_qlen =~ /\D/ or $min_qlen <= 0 ) {
2278 $self->throw(
2279 -class => 'Bio::Root::BadParameter',
2280 -text => "Invalid minimum query length value: $min_qlen\n"
2281 . "Value must be an integer > 0. Value not set.",
2282 -value => $min_qlen
2285 $self->{'_confirm_qlength'} = 1;
2286 $self->{'_min_query_length'} = $min_qlen;
2289 return $self->{'_min_query_length'};
2292 =head2 best_hit_only
2294 Title : best_hit_only
2295 Usage : print "only getting best hit.\n" if $obj->best_hit_only;
2296 Purpose : Set/Get the indicator for whether or not to process only
2297 : the best BlastHit.
2298 Returns : Boolean (1 | 0)
2299 Argument : Boolean (1 | 0) (when setting)
2301 =cut
2303 sub best_hit_only {
2304 my $self = shift;
2305 if (@_) { $self->{'_best'} = shift; }
2306 $self->{'_best'};
2309 =head2 check_all_hits
2311 Title : check_all_hits
2312 Usage : print "checking all hits.\n" if $obj->check_all_hits;
2313 Purpose : Set/Get the indicator for whether or not to process all hits.
2314 : If false, the parser will stop processing hits after the
2315 : the first non-significance hit or the first hit that fails
2316 : any hit filter.
2317 Returns : Boolean (1 | 0)
2318 Argument : Boolean (1 | 0) (when setting)
2320 =cut
2322 sub check_all_hits {
2323 my $self = shift;
2324 if (@_) { $self->{'_check_all'} = shift; }
2325 $self->{'_check_all'};
2328 # commented out, using common base class util method
2329 #=head2 _get_accession_version
2331 # Title : _get_accession_version
2332 # Usage : my ($acc,$ver) = &_get_accession_version($id)
2333 # Function:Private function to get an accession,version pair
2334 # for an ID (if it is in NCBI format)
2335 # Returns : 2-pule of accession, version
2336 # Args : ID string to process
2339 #=cut
2341 #sub _get_accession_version {
2342 # my $id = shift;
2344 # # handle case when this is accidently called as a class method
2345 # if ( ref($id) && $id->isa('Bio::SearchIO') ) {
2346 # $id = shift;
2348 # return unless defined $id;
2349 # my ( $acc, $version );
2350 # if ( $id =~ /(gb|emb|dbj|sp|pdb|bbs|ref|lcl)\|(.*)\|(.*)/ ) {
2351 # ( $acc, $version ) = split /\./, $2;
2353 # elsif ( $id =~ /(pir|prf|pat|gnl)\|(.*)\|(.*)/ ) {
2354 # ( $acc, $version ) = split /\./, $3;
2356 # else {
2358 # #punt, not matching the db's at ftp://ftp.ncbi.nih.gov/blast/db/README
2359 # #Database Name Identifier Syntax
2360 # #============================ ========================
2361 # #GenBank gb|accession|locus
2362 # #EMBL Data Library emb|accession|locus
2363 # #DDBJ, DNA Database of Japan dbj|accession|locus
2364 # #NBRF PIR pir||entry
2365 # #Protein Research Foundation prf||name
2366 # #SWISS-PROT sp|accession|entry name
2367 # #Brookhaven Protein Data Bank pdb|entry|chain
2368 # #Patents pat|country|number
2369 # #GenInfo Backbone Id bbs|number
2370 # #General database identifier gnl|database|identifier
2371 # #NCBI Reference Sequence ref|accession|locus
2372 # #Local Sequence identifier lcl|identifier
2373 # $acc = $id;
2375 # return ( $acc, $version );
2378 # general private method used to make minimal hits from leftover
2379 # data in the hit table
2381 sub _cleanup_hits {
2382 my ($self, $hits) = @_;
2383 while ( my $v = shift @{ $hits }) {
2384 next unless defined $v;
2385 $self->start_element( { 'Name' => 'Hit' } );
2386 my $id = $v->[2];
2387 my $desc = $v->[3];
2388 $self->element(
2390 'Name' => 'Hit_id',
2391 'Data' => $id
2394 my ($gi, $acc, $version ) = $self->_get_seq_identifiers($id);
2395 $self->element(
2397 'Name' => 'Hit_accession',
2398 'Data' => $acc
2401 if ( defined $v ) {
2402 $self->element(
2404 'Name' => 'Hit_signif',
2405 'Data' => $v->[0]
2408 if (exists $self->{'_wublast'}) {
2409 $self->element(
2411 'Name' => 'Hit_score',
2412 'Data' => $v->[1]
2415 } else {
2416 $self->element(
2418 'Name' => 'Hit_bits',
2419 'Data' => $v->[1]
2425 $self->element(
2427 'Name' => 'Hit_def',
2428 'Data' => $desc
2431 $self->end_element( { 'Name' => 'Hit' } );
2438 __END__
2440 Developer Notes
2441 ---------------
2443 The following information is added in hopes of increasing the
2444 maintainability of this code. It runs the risk of becoming obsolete as
2445 the code gets updated. As always, double check against the actual
2446 source. If you find any discrepencies, please correct them.
2447 [ This documentation added on 3 Jun 2003. ]
2449 The logic is the brainchild of Jason Stajich, documented by Steve
2450 Chervitz. Jason: please check it over and modify as you see fit.
2452 Question:
2453 Elmo wants to know: How does this module unmarshall data from the input stream?
2454 (i.e., how does information from a raw input file get added to
2455 the correct Bioperl object?)
2457 Answer:
2459 This answer is specific to SearchIO::blast, but may apply to other
2460 SearchIO.pm subclasses as well. The following description gives the
2461 basic idea. The actual processing is a little more complex for
2462 certain types of data (HSP, Report Parameters).
2464 You can think of blast::next_result() as faking a SAX XML parser,
2465 making a non-XML document behave like its XML. The overhead to do this
2466 is quite substantial (~650 lines of code instead of ~80 in
2467 blastxml.pm).
2469 0. First, add a key => value pair for the datum of interest to %MAPPING
2470 Example:
2471 'Foo_bar' => 'Foo-bar',
2473 1. next_result() collects the datum of interest from the input stream,
2474 and calls element().
2475 Example:
2476 $self->element({ 'Name' => 'Foo_bar',
2477 'Data' => $foobar});
2479 2. The element() method is a convenience method that calls start_element(),
2480 characters(), and end_element().
2482 3. start_element() checks to see if the event handler can handle a start_xxx(),
2483 where xxx = the 'Name' parameter passed into element(), and calls start_xxx()
2484 if so. Otherwise, start_element() does not do anything.
2486 Data that will have such an event handler are defined in %MODEMAP.
2487 Typically, there are only handler methods for the main parts of
2488 the search result (e.g., Result, Iteration, Hit, HSP),
2489 which have corresponding Bioperl modules. So in this example,
2490 there was an earlier call such as $self->element({'Name'=>'Foo'})
2491 and the Foo_bar datum is meant to ultimately go into a Foo object.
2493 The start_foo() method in the handler will typically do any
2494 data initialization necessary to prepare for creating a new Foo object.
2495 Example: SearchResultEventBuilder::start_result()
2497 4. characters() takes the value of the 'Data' key from the hashref argument in
2498 the elements() call and saves it in a local data member:
2499 Example:
2500 $self->{'_last_data'} = $data->{'Data'};
2502 5. end_element() is like start_element() in that it does the check for whether
2503 the event handler can handle end_xxx() and if so, calls it, passing in
2504 the data collected from all of the characters() calls that occurred
2505 since the start_xxx() call.
2507 If there isn't any special handler for the data type specified by 'Name',
2508 end_element() will place the data saved by characters() into another
2509 local data member that saves it in a hash with a key defined by %MAPPING.
2510 Example:
2511 $nm = $data->{'Name'};
2512 $self->{'_values'}->{$MAPPING{$nm}} = $self->{'_last_data'};
2514 In this case, $MAPPING{$nm} is 'Foo-bar'.
2516 end_element() finishes by resetting the local data member used by
2517 characters(). (i.e., $self->{'_last_data'} = '';)
2519 6. When the next_result() method encounters the end of the Foo element in the
2520 input stream. It will invoke $self->end_element({'Name'=>'Foo'}).
2521 end_element() then sends all of the data in the $self->{'_values'} hash.
2522 Note that $self->{'_values'} is cleaned out during start_element(),
2523 keeping it at a resonable size.
2525 In the event handler, the end_foo() method takes the hash from end_element()
2526 and creates a new hash containing the same data, but having keys lacking
2527 the 'Foo' prefix (e.g., 'Foo-bar' becomes '-bar'). The handler's end_foo()
2528 method then creates the Foo object, passing in this new hash as an argument.
2529 Example: SearchResultEventBuilder::end_result()
2531 7. Objects created from the data in the search result are managed by
2532 the event handler which adds them to a ResultI object (using API methods
2533 for that object). The ResultI object gets passed back to
2534 SearchIO::end_element() when it calls end_result().
2536 The ResultI object is then saved in an internal data member of the
2537 SearchIO object, which returns it at the end of next_result()
2538 by calling end_document().
2540 (Technical Note: All objects created by end_xxx() methods in the event
2541 handler are returned to SearchIO::end_element(), but the SearchIO object
2542 only cares about the ResultI objects.)
2544 (Sesame Street aficionados note: This answer was NOT given by Mr. Noodle ;-P)