bug 2549; fixed small bug in Bio::Taxon which doesn't catch -common_name
[bioperl-live.git] / Bio / SearchIO / blast.pm
blobbdfba9642de604d71528f8b42c507d5b6d7fa198
1 # $Id$
3 # BioPerl module for Bio::SearchIO::blast
5 # Cared for by Jason Stajich <jason@bioperl.org>
7 # Copyright Jason Stajich
9 # You may distribute this module under the same terms as perl itself
11 # POD documentation - main docs before the code
13 # 20030409 - sac
14 # PSI-BLAST full parsing support. Rollout of new
15 # model which will remove Steve's old psiblast driver
16 # 20030424 - jason
17 # Megablast parsing fix as reported by Neil Saunders
18 # 20030427 - jason
19 # Support bl2seq parsing
20 # 20031124 - jason
21 # Parse more blast statistics, lambda, entropy, etc
22 # from WU-BLAST in frame-specific manner
23 # 20060216 - cjf - fixed blast parsing for BLAST v2.2.13 output
24 # 20071104 - dmessina - added support for WUBLAST -echofilter
25 # 20071121 - cjf - fixed several bugs (bugs 2391, 2399, 2409)
27 =head1 NAME
29 Bio::SearchIO::blast - Event generator for event based parsing of
30 blast reports
32 =head1 SYNOPSIS
34 # Do not use this object directly - it is used as part of the
35 # Bio::SearchIO system.
37 use Bio::SearchIO;
38 my $searchio = Bio::SearchIO->new(-format => 'blast',
39 -file => 't/data/ecolitst.bls');
40 while( my $result = $searchio->next_result ) {
41 while( my $hit = $result->next_hit ) {
42 while( my $hsp = $hit->next_hsp ) {
43 # ...
48 =head1 DESCRIPTION
50 This object encapsulated the necessary methods for generating events
51 suitable for building Bio::Search objects from a BLAST report file.
52 Read the L<Bio::SearchIO> for more information about how to use this.
54 This driver can parse:
56 =over 4
58 =item *
60 NCBI produced plain text BLAST reports from blastall, this also
61 includes PSIBLAST, PSITBLASTN, RPSBLAST, and bl2seq reports. NCBI XML
62 BLAST output is parsed with the blastxml SearchIO driver
64 =item *
66 WU-BLAST all reports
68 =item *
70 Jim Kent's BLAST-like output from his programs (BLASTZ, BLAT)
72 =item *
74 BLAST-like output from Paracel BTK output
76 =back
78 =head2 bl2seq parsing
80 Since I cannot differentiate between BLASTX and TBLASTN since bl2seq
81 doesn't report the algorithm used - I assume it is BLASTX by default -
82 you can supply the program type with -report_type in the SearchIO
83 constructor i.e.
85 my $parser = Bio::SearchIO->new(-format => 'blast',
86 -file => 'bl2seq.tblastn.report',
87 -report_type => 'tblastn');
89 This only really affects where the frame and strand information are
90 put - they will always be on the $hsp-E<gt>query instead of on the
91 $hsp-E<gt>hit part of the feature pair for blastx and tblastn bl2seq
92 produced reports. Hope that's clear...
94 =head1 FEEDBACK
96 =head2 Mailing Lists
98 User feedback is an integral part of the evolution of this and other
99 Bioperl modules. Send your comments and suggestions preferably to
100 the Bioperl mailing list. Your participation is much appreciated.
102 bioperl-l@bioperl.org - General discussion
103 http://bioperl.org/wiki/Mailing_lists - About the mailing lists
105 =head2 Reporting Bugs
107 Report bugs to the Bioperl bug tracking system to help us keep track
108 of the bugs and their resolution. Bug reports can be submitted via the
109 web:
111 http://bugzilla.open-bio.org/
113 =head1 AUTHOR - Jason Stajich
115 Email Jason Stajich jason-at-bioperl.org
117 =head1 CONTRIBUTORS
119 Steve Chervitz sac-at-bioperl.org
121 =head1 APPENDIX
123 The rest of the documentation details each of the object methods.
124 Internal methods are usually preceded with a _
126 =cut
128 # Let the code begin...'
130 package Bio::SearchIO::blast;
132 use Bio::SearchIO::IteratedSearchResultEventBuilder;
133 use strict;
134 use vars qw(%MAPPING %MODEMAP
135 $DEFAULT_BLAST_WRITER_CLASS
136 $MAX_HSP_OVERLAP
137 $DEFAULT_SIGNIF
138 $DEFAULT_SCORE
139 $DEFAULTREPORTTYPE
143 use base qw(Bio::SearchIO);
144 use Data::Dumper;
146 BEGIN {
148 # mapping of NCBI Blast terms to Bioperl hash keys
149 %MODEMAP = (
150 'BlastOutput' => 'result',
151 'Iteration' => 'iteration',
152 'Hit' => 'hit',
153 'Hsp' => 'hsp'
156 # This should really be done more intelligently, like with
157 # XSLT
159 %MAPPING = (
160 'Hsp_bit-score' => 'HSP-bits',
161 'Hsp_score' => 'HSP-score',
162 'Hsp_evalue' => 'HSP-evalue',
163 'Hsp_n', => 'HSP-n',
164 'Hsp_pvalue' => 'HSP-pvalue',
165 'Hsp_query-from' => 'HSP-query_start',
166 'Hsp_query-to' => 'HSP-query_end',
167 'Hsp_hit-from' => 'HSP-hit_start',
168 'Hsp_hit-to' => 'HSP-hit_end',
169 'Hsp_positive' => 'HSP-conserved',
170 'Hsp_identity' => 'HSP-identical',
171 'Hsp_gaps' => 'HSP-hsp_gaps',
172 'Hsp_hitgaps' => 'HSP-hit_gaps',
173 'Hsp_querygaps' => 'HSP-query_gaps',
174 'Hsp_qseq' => 'HSP-query_seq',
175 'Hsp_hseq' => 'HSP-hit_seq',
176 'Hsp_midline' => 'HSP-homology_seq',
177 'Hsp_align-len' => 'HSP-hsp_length',
178 'Hsp_query-frame' => 'HSP-query_frame',
179 'Hsp_hit-frame' => 'HSP-hit_frame',
180 'Hsp_links' => 'HSP-links',
181 'Hsp_group' => 'HSP-hsp_group',
183 'Hit_id' => 'HIT-name',
184 'Hit_len' => 'HIT-length',
185 'Hit_accession' => 'HIT-accession',
186 'Hit_def' => 'HIT-description',
187 'Hit_signif' => 'HIT-significance',
188 # For NCBI blast, the description line contains bits.
189 # For WU-blast, the description line contains score.
190 'Hit_score' => 'HIT-score',
191 'Hit_bits' => 'HIT-bits',
193 'Iteration_iter-num' => 'ITERATION-number',
194 'Iteration_converged' => 'ITERATION-converged',
196 'BlastOutput_program' => 'RESULT-algorithm_name',
197 'BlastOutput_version' => 'RESULT-algorithm_version',
198 'BlastOutput_query-def' => 'RESULT-query_name',
199 'BlastOutput_query-len' => 'RESULT-query_length',
200 'BlastOutput_query-acc' => 'RESULT-query_accession',
201 'BlastOutput_query-gi' => 'RESULT-query_gi',
202 'BlastOutput_querydesc' => 'RESULT-query_description',
203 'BlastOutput_db' => 'RESULT-database_name',
204 'BlastOutput_db-len' => 'RESULT-database_entries',
205 'BlastOutput_db-let' => 'RESULT-database_letters',
206 'BlastOutput_inclusion-threshold' => 'RESULT-inclusion_threshold',
208 'Parameters_matrix' => { 'RESULT-parameters' => 'matrix' },
209 'Parameters_expect' => { 'RESULT-parameters' => 'expect' },
210 'Parameters_include' => { 'RESULT-parameters' => 'include' },
211 'Parameters_sc-match' => { 'RESULT-parameters' => 'match' },
212 'Parameters_sc-mismatch' => { 'RESULT-parameters' => 'mismatch' },
213 'Parameters_gap-open' => { 'RESULT-parameters' => 'gapopen' },
214 'Parameters_gap-extend' => { 'RESULT-parameters' => 'gapext' },
215 'Parameters_filter' => { 'RESULT-parameters' => 'filter' },
216 'Parameters_allowgaps' => { 'RESULT-parameters' => 'allowgaps' },
217 'Parameters_full_dbpath' => { 'RESULT-parameters' => 'full_dbpath' },
218 'Statistics_db-len' => { 'RESULT-statistics' => 'dbentries' },
219 'Statistics_db-let' => { 'RESULT-statistics' => 'dbletters' },
220 'Statistics_hsp-len' =>
221 { 'RESULT-statistics' => 'effective_hsplength' },
222 'Statistics_query-len' => { 'RESULT-statistics' => 'querylength' },
223 'Statistics_eff-space' => { 'RESULT-statistics' => 'effectivespace' },
224 'Statistics_eff-spaceused' =>
225 { 'RESULT-statistics' => 'effectivespaceused' },
226 'Statistics_eff-dblen' =>
227 { 'RESULT-statistics' => 'effectivedblength' },
228 'Statistics_kappa' => { 'RESULT-statistics' => 'kappa' },
229 'Statistics_lambda' => { 'RESULT-statistics' => 'lambda' },
230 'Statistics_entropy' => { 'RESULT-statistics' => 'entropy' },
231 'Statistics_gapped_kappa' => { 'RESULT-statistics' => 'kappa_gapped' },
232 'Statistics_gapped_lambda' =>
233 { 'RESULT-statistics' => 'lambda_gapped' },
234 'Statistics_gapped_entropy' =>
235 { 'RESULT-statistics' => 'entropy_gapped' },
237 'Statistics_framewindow' =>
238 { 'RESULT-statistics' => 'frameshiftwindow' },
239 'Statistics_decay' => { 'RESULT-statistics' => 'decayconst' },
241 'Statistics_hit_to_db' => { 'RESULT-statistics' => 'Hits_to_DB' },
242 'Statistics_num_suc_extensions' =>
243 { 'RESULT-statistics' => 'num_successful_extensions' },
245 # WU-BLAST stats
246 'Statistics_DFA_states' => { 'RESULT-statistics' => 'num_dfa_states' },
247 'Statistics_DFA_size' => { 'RESULT-statistics' => 'dfa_size' },
248 'Statistics_noprocessors' =>
249 { 'RESULT-statistics' => 'no_of_processors' },
250 'Statistics_neighbortime' =>
251 { 'RESULT-statistics' => 'neighborhood_generate_time' },
252 'Statistics_starttime' => { 'RESULT-statistics' => 'start_time' },
253 'Statistics_endtime' => { 'RESULT-statistics' => 'end_time' },
256 # add WU-BLAST Frame-Based Statistics
257 for my $frame ( 0 .. 3 ) {
258 for my $strand ( '+', '-' ) {
259 for my $ind (
260 qw(length efflength E S W T X X_gapped E2
261 E2_gapped S2)
264 $MAPPING{"Statistics_frame$strand$frame\_$ind"} =
265 { 'RESULT-statistics' => "Frame$strand$frame\_$ind" };
267 for my $val (qw(lambda kappa entropy )) {
268 for my $type (qw(used computed gapped)) {
269 my $key = "Statistics_frame$strand$frame\_$val\_$type";
270 my $val =
271 { 'RESULT-statistics' =>
272 "Frame$strand$frame\_$val\_$type" };
273 $MAPPING{$key} = $val;
279 # add Statistics
280 for my $stats (
281 qw(T A X1 X2 X3 S1 S2 X1_bits X2_bits X3_bits
282 S1_bits S2_bits num_extensions
283 num_successful_extensions
284 seqs_better_than_cutoff
285 posted_date
286 search_cputime total_cputime
287 search_actualtime total_actualtime
288 no_of_processors ctxfactor)
291 my $key = "Statistics_$stats";
292 my $val = { 'RESULT-statistics' => $stats };
293 $MAPPING{$key} = $val;
296 # add WU-BLAST Parameters
297 for my $param (
298 qw(span span1 span2 links warnings notes hspsepsmax
299 hspsepqmax topcomboN topcomboE postsw cpus wordmask
300 filter sort_by_pvalue sort_by_count sort_by_highscore
301 sort_by_totalscore sort_by_subjectlength noseqs gi qtype
302 qres V B Z Y M N)
305 my $key = "Parameters_$param";
306 my $val = { 'RESULT-parameters' => $param };
307 $MAPPING{$key} = $val;
310 $DEFAULT_BLAST_WRITER_CLASS = 'Bio::Search::Writer::HitTableWriter';
311 $MAX_HSP_OVERLAP = 2; # Used when tiling multiple HSPs.
312 $DEFAULTREPORTTYPE = 'BLASTP'; # for bl2seq
315 =head2 new
317 Title : new
318 Usage : my $obj = Bio::SearchIO::blast->new(%args);
319 Function: Builds a new Bio::SearchIO::blast object
320 Returns : Bio::SearchIO::blast
321 Args : Key-value pairs:
322 -fh/-file => filehandle/filename to BLAST file
323 -format => 'blast'
324 -report_type => 'blastx', 'tblastn', etc -- only for bl2seq
325 reports when you want to distinguish between
326 tblastn and blastx reports (this only controls
327 where the frame information is put - on the query
328 or subject object.
329 -inclusion_threshold => e-value threshold for inclusion in the
330 PSI-BLAST score matrix model (blastpgp)
331 -signif => float or scientific notation number to be used
332 as a P- or Expect value cutoff
333 -score => integer or scientific notation number to be used
334 as a blast score value cutoff
335 -bits => integer or scientific notation number to be used
336 as a bit score value cutoff
337 -hit_filter => reference to a function to be used for
338 filtering hits based on arbitrary criteria.
339 All hits of each BLAST report must satisfy
340 this criteria to be retained.
341 If a hit fails this test, it is ignored.
342 This function should take a
343 Bio::Search::Hit::BlastHit.pm object as its first
344 argument and return true
345 if the hit should be retained.
346 Sample filter function:
347 -hit_filter => sub { $hit = shift;
348 $hit->gaps == 0; },
349 (Note: -filt_func is synonymous with -hit_filter)
350 -overlap => integer. The amount of overlap to permit between
351 adjacent HSPs when tiling HSPs. A reasonable value is 2.
352 Default = $Bio::SearchIO::blast::MAX_HSP_OVERLAP.
354 The following criteria are not yet supported:
355 (these are probably best applied within this module rather than in the
356 event handler since they would permit the parser to take some shortcuts.)
358 -check_all_hits => boolean. Check all hits for significance against
359 significance criteria. Default = false.
360 If false, stops processing hits after the first
361 non-significant hit or the first hit that fails
362 the hit_filter call. This speeds parsing,
363 taking advantage of the fact that the hits
364 are processed in the order they appear in the report.
365 -min_query_len => integer to be used as a minimum for query sequence length.
366 Reports with query sequences below this length will
367 not be processed. Default = no minimum length.
368 -best => boolean. Only process the best hit of each report;
369 default = false.
371 =cut
373 sub _initialize {
374 my ( $self, @args ) = @_;
375 $self->SUPER::_initialize(@args);
377 # Blast reports require a specialized version of the SREB due to the
378 # possibility of iterations (PSI-BLAST). Forwarding all arguments to it. An
379 # issue here is that we want to set new default object factories if none are
380 # supplied.
382 my $handler = Bio::SearchIO::IteratedSearchResultEventBuilder->new(@args);
383 $self->attach_EventHandler($handler);
385 # 2006-04-26 move this to the attach_handler function in this module so we
386 # can really reset the handler
387 # Optimization: caching
388 # the EventHandler since it is used a lot during the parse.
390 # $self->{'_handler_cache'} = $handler;
392 my ( $min_qlen, $check_all, $overlap, $best, $rpttype ) = $self->_rearrange(
394 qw(MIN_LENGTH CHECK_ALL_HITS
395 OVERLAP BEST
396 REPORT_TYPE)
398 @args
401 defined $min_qlen && $self->min_query_length($min_qlen);
402 defined $best && $self->best_hit_only($best);
403 defined $check_all && $self->check_all_hits($check_all);
404 defined $rpttype && ( $self->{'_reporttype'} = $rpttype );
407 sub attach_EventHandler {
408 my ($self,$handler) = @_;
410 $self->SUPER::attach_EventHandler($handler);
412 # Optimization: caching the EventHandler since it is used a lot
413 # during the parse.
415 $self->{'_handler_cache'} = $handler;
416 return;
419 =head2 next_result
421 Title : next_result
422 Usage : my $hit = $searchio->next_result;
423 Function: Returns the next Result from a search
424 Returns : Bio::Search::Result::ResultI object
425 Args : none
427 =cut
429 sub next_result {
430 my ($self) = @_;
431 my $v = $self->verbose;
432 my $data = '';
433 my $flavor = '';
434 $self->{'_seentop'} = 0; # start next report at top
435 $self->{'_seentop'} = 0;
436 my ( $reporttype, $seenquery, $reportline );
437 my ( $seeniteration, $found_again );
438 my $incl_threshold = $self->inclusion_threshold;
439 my $bl2seq_fix;
440 $self->start_document(); # let the fun begin...
441 my (@hit_signifs);
442 my $gapped_stats = 0; # for switching between gapped/ungapped
443 # lambda, K, H
444 local $_ = "\n"; #consistency
445 PARSER:
446 while ( defined( $_ = $self->_readline ) ) {
447 next if (/^\s+$/); # skip empty lines
448 next if (/CPU time:/);
449 next if (/^>\s*$/);
450 if (
451 /^((?:\S+?)?BLAST[NPX])\s+(.+)$/i # NCBI BLAST, PSIBLAST
452 # RPSBLAST, MEGABLAST
453 || /^(P?GENEWISE|HFRAME|SWN|TSWN)\s+(.+)/i #Paracel BTK
456 ($reporttype, my $reportversion) = ($1, $2);
457 # need to keep track of whether this is WU-BLAST
458 if ($reportversion && $reportversion =~ m{WashU$}) {
459 $self->{'_wublast'}++;
461 $self->debug("blast.pm: Start of new report: $reporttype, $reportversion\n");
462 if ( $self->{'_seentop'} ) {
463 # This handles multi-result input streams
464 $self->_pushback($_);
465 last PARSER;
467 $self->_start_blastoutput;
468 if ($reporttype =~ /RPS-BLAST/) {
469 $reporttype .= '(BLASTP)'; # default RPS-BLAST type
471 $reportline = $_; # to fix the fact that RPS-BLAST output is wrong
472 $self->element(
474 'Name' => 'BlastOutput_program',
475 'Data' => $reporttype
479 $self->element(
481 'Name' => 'BlastOutput_version',
482 'Data' => $reportversion
485 $self->element(
487 'Name' => 'BlastOutput_inclusion-threshold',
488 'Data' => $incl_threshold
492 # added Windows workaround for bug 1985
493 elsif (/^(Searching|Results from round)/) {
494 next unless $1 =~ /Results from round/;
495 $self->debug("blast.pm: Possible psi blast iterations found...\n");
497 $self->in_element('hsp')
498 && $self->end_element( { 'Name' => 'Hsp' } );
499 $self->in_element('hit')
500 && $self->end_element( { 'Name' => 'Hit' } );
501 if ( defined $seeniteration ) {
502 $self->within_element('iteration')
503 && $self->end_element( { 'Name' => 'Iteration' } );
504 $self->start_element( { 'Name' => 'Iteration' } );
506 else {
507 $self->start_element( { 'Name' => 'Iteration' } );
509 $seeniteration = 1;
511 elsif (/^Query=\s*(.*)$/) {
512 my $q = $1;
513 $self->debug("blast.pm: Query= found...$_\n");
514 my $size = 0;
515 if ( defined $seenquery ) {
516 $self->_pushback($reportline) if $reportline;
517 $self->_pushback($_);
518 last PARSER;
520 else {
521 if ( !defined $reporttype ) {
522 $self->_start_blastoutput;
523 if ( defined $seeniteration ) {
524 $self->in_element('iteration')
525 && $self->end_element( { 'Name' => 'Iteration' } );
526 $self->start_element( { 'Name' => 'Iteration' } );
528 else {
529 $self->start_element( { 'Name' => 'Iteration' } );
531 $seeniteration = 1;
534 $seenquery = $q;
535 $_ = $self->_readline;
536 while ( defined($_) ) {
537 if (/^Database:/) {
538 $self->_pushback($_);
539 last;
541 # below line fixes length issue with BLAST v2.2.13; still works
542 # with BLAST v2.2.12
543 if ( /\((\-?[\d,]+)\s+letters.*\)/ || /^Length=(\-?[\d,]+)/ ) {
544 $size = $1;
545 $size =~ s/,//g;
546 last;
548 else {
549 # bug 2391
550 $q .= ($q =~ /\w$/ && $_ =~ /^\w/) ? " $_" : $_;
551 $q =~ s/\s+/ /g; # this catches the newline as well
552 $q =~ s/^ | $//g;
555 $_ = $self->_readline;
557 chomp($q);
558 my ( $nm, $desc ) = split( /\s+/, $q, 2 );
559 $self->element(
561 'Name' => 'BlastOutput_query-def',
562 'Data' => $nm
565 $self->element(
567 'Name' => 'BlastOutput_query-len',
568 'Data' => $size
571 defined $desc && $desc =~ s/\s+$//;
572 $self->element(
574 'Name' => 'BlastOutput_querydesc',
575 'Data' => $desc
578 my ( $gi, $acc, $version ) = $self->_get_seq_identifiers($nm);
579 $version = defined($version) && length($version) ? ".$version" : "";
580 $acc = '' unless defined($acc);
581 $self->element(
583 'Name' => 'BlastOutput_query-acc',
584 'Data' => "$acc$version"
588 # added check for WU-BLAST -echofilter option (bug 2388)
589 elsif (/^>Unfiltered[+-]1$/) {
590 # skip all of the lines of unfiltered sequence
591 while($_ !~ /^Database:/) {
592 $self->debug("Bypassing features line: $_");
593 $_ = $self->_readline;
595 $self->_pushback($_);
597 elsif (/Sequences producing significant alignments:/) {
598 $self->debug("blast.pm: Processing NCBI-BLAST descriptions\n");
599 $flavor = 'ncbi';
601 # PSI-BLAST parsing needs to be fixed to specifically look
602 # for old vs new per iteration, as sorting based on duplication
603 # leads to bugs, see bug 1986
605 # The next line is not necessarily whitespace in psiblast reports.
606 # Also note that we must look for the end of this section by testing
607 # for a line with a leading >. Blank lines occur with this section
608 # for psiblast.
609 if ( !$self->in_element('iteration') ) {
610 $self->start_element( { 'Name' => 'Iteration' } );
612 # these elements are dropped with some multiquery reports; add
613 # back here
614 $self->element(
616 'Name' => 'BlastOutput_db-len',
617 'Data' => $self->{'_blsdb_length'}
619 ) if $self->{'_blsdb_length'};
620 $self->element(
622 'Name' => 'BlastOutput_db-let',
623 'Data' => $self->{'_blsdb_letters'}
625 ) if $self->{'_blsdb_letters'};
626 $self->element(
628 'Name' => 'BlastOutput_db',
629 'Data' => $self->{'_blsdb'}
631 ) if $self->{'_blsdb_letters'};
633 # changed 7/15/2007 to only check certain lines, end while if something doesn't match
634 my $h_regex;
635 DESCLINE:
636 while ( defined( my $descline = $self->_readline() ) ) {
637 next if $descline =~ m{^\s*$};
638 # GCG multiline oddness...
639 if ($descline =~ /^(\S+)\s+Begin:\s\d+\s+End:\s+\d+/xms) {
640 my ($id, $nextline) = ($1, $self->_readline);
641 $nextline =~ s{^!}{};
642 $descline = "$id $nextline";
644 # NCBI style hit table (no N)
645 if ($descline =~ /(?<!cor) # negative lookahead
646 (\d+\.?(?:[\+\-eE]+)?\d+) # number (float or scientific notation)
647 \s+ # space
648 ([\de]+\.?\d*(?:[\+\-eE]+)?\d*) # number (float or scientific notation)
649 \s*$/xms) {
651 my ( $score, $evalue ) = ($1, $2);
653 # Some data clean-up so e-value will appear numeric to perl
654 $evalue =~ s/^e/1e/i;
656 # This to handle no-HSP case
657 my @line = split ' ',$descline;
659 # we want to throw away the score, evalue
660 pop @line, pop @line;
662 # and N if it is present (of course they are not
663 # really in that order, but it doesn't matter
664 if ($3) { pop @line }
666 # add the last 2 entries s.t. we can reconstruct
667 # a minimal Hit object at the end of the day
668 push @hit_signifs, [ $evalue, $score, shift @line, join( ' ', @line ) ];
669 } elsif ($descline =~ /^CONVERGED/i) {
670 $self->element(
672 'Name' => 'Iteration_converged',
673 'Data' => 1
676 } else {
677 $self->_pushback($descline); # Catch leading > (end of section)
678 last DESCLINE;
682 elsif (/Sequences producing High-scoring Segment Pairs:/) {
684 # This block is for WU-BLAST, so we don't have to check for psi-blast stuff
685 # skip the next line
686 $self->debug("blast.pm: Processing WU-BLAST descriptions\n");
687 $_ = $self->_readline();
688 $flavor = 'wu';
690 if ( !$self->in_element('iteration') ) {
691 $self->start_element( { 'Name' => 'Iteration' } );
694 while ( defined( $_ = $self->_readline() )
695 && !/^\s+$/ )
697 my @line = split;
698 pop @line; # throw away first number which is for 'N'col
700 # add the last 2 entries to array s.t. we can reconstruct
701 # a minimal Hit object at the end of the day
702 push @hit_signifs,
703 [ pop @line, pop @line, shift @line, join( ' ', @line ) ];
707 elsif (/^Database:\s*(.+)$/) {
709 $self->debug("blast.pm: Database: $1\n");
710 my $db = $1;
711 while ( defined( $_ = $self->_readline ) ) {
712 if (
713 /^\s+(\-?[\d\,]+|\S+)\s+sequences\;
714 \s+(\-?[\d,]+|\S+)\s+ # Deal with NCBI 2.2.8 OSX problems
715 total\s+letters/ox
718 my ( $s, $l ) = ( $1, $2 );
719 $s =~ s/,//g;
720 $l =~ s/,//g;
721 $self->element(
723 'Name' => 'BlastOutput_db-len',
724 'Data' => $s
727 $self->element(
729 'Name' => 'BlastOutput_db-let',
730 'Data' => $l
733 # cache for next round in cases with multiple queries
734 $self->{'_blsdb'} = $db;
735 $self->{'_blsdb_length'} = $s;
736 $self->{'_blsdb_letters'} = $l;
737 last;
739 else {
740 chomp;
741 $db .= $_;
744 $self->element(
746 'Name' => 'BlastOutput_db',
747 'Data' => $db
751 # bypasses this NCBI blast 2.2.13 extra output for now...
752 # Features in/flanking this part of subject sequence:
753 elsif (/^\sFeatures\s\w+\sthis\spart\sof\ssubject\ssequence:/xmso) {
754 # junk following lines up to start of HSP
755 while($_ !~ /^\sScore\s=/) {
756 $self->debug("Bypassing features line: $_");
757 $_ = $self->_readline;
759 $self->_pushback($_);
762 # move inside of a hit
763 elsif (/^>\s*(\S+)\s*(.*)?/) {
764 chomp;
766 $self->debug("blast.pm: Hit: $1\n");
767 $self->in_element('hsp')
768 && $self->end_element( { 'Name' => 'Hsp' } );
769 $self->in_element('hit')
770 && $self->end_element( { 'Name' => 'Hit' } );
772 # special case when bl2seq reports don't have a leading
773 # Query=
774 if ( !$self->within_element('result') ) {
775 $self->_start_blastoutput;
776 $self->start_element( { 'Name' => 'Iteration' } );
778 elsif ( !$self->within_element('iteration') ) {
779 $self->start_element( { 'Name' => 'Iteration' } );
781 $self->start_element( { 'Name' => 'Hit' } );
782 my $id = $1;
783 my $restofline = $2;
785 $self->debug("Starting a hit: $1 $2\n");
786 $self->element(
788 'Name' => 'Hit_id',
789 'Data' => $id
792 my ($gi, $acc, $version ) = $self->_get_seq_identifiers($id);
793 $self->element(
795 'Name' => 'Hit_accession',
796 'Data' => $acc
799 # add hit significance (from the hit table)
800 # this is where Bug 1986 went awry
802 # Changed for Bug2409; hit->significance and hit->score/bits derived
803 # from HSPs, not hit table unless necessary
805 HITTABLE:
806 while (my $v = shift @hit_signifs) {
807 my $tableid = $v->[2];
808 if ($tableid !~ m{\Q$id\E}) {
809 $self->debug("Hit table ID $tableid doesn't match current hit id $id, checking next hit table entry...\n");
810 next HITTABLE;
811 } else {
812 last HITTABLE;
815 while ( defined( $_ = $self->_readline() ) ) {
816 next if (/^\s+$/);
817 chomp;
818 if (/Length\s*=\s*([\d,]+)/) {
819 my $l = $1;
820 $l =~ s/\,//g;
821 $self->element(
823 'Name' => 'Hit_len',
824 'Data' => $l
827 last;
829 else {
830 s/^\s(?!\s)/\x01/; #new line to concatenate desc lines with <soh>
831 $restofline .= $_;
834 $restofline =~ s/\s+/ /g;
835 $self->element(
837 'Name' => 'Hit_def',
838 'Data' => $restofline
842 elsif (/\s+(Plus|Minus) Strand HSPs:/i) {
843 next;
845 elsif (
846 ( $self->in_element('hit') || $self->in_element('hsp') )
847 && # paracel genewise BTK
848 m/Score\s*=\s*(\S+)\s*bits\s* # Bit score
849 (?:\((\d+)\))?, # Raw score
850 \s+Log\-Length\sScore\s*=\s*(\d+) # Log-Length score
854 $self->in_element('hsp')
855 && $self->end_element( { 'Name' => 'Hsp' } );
856 $self->start_element( { 'Name' => 'Hsp' } );
858 $self->debug( "Got paracel genewise HSP score=$1\n");
860 # Some data clean-up so e-value will appear numeric to perl
861 my ( $bits, $score, $evalue ) = ( $1, $2, $3 );
862 $evalue =~ s/^e/1e/i;
863 $self->element(
865 'Name' => 'Hsp_score',
866 'Data' => $score
869 $self->element(
871 'Name' => 'Hsp_bit-score',
872 'Data' => $bits
875 $self->element(
877 'Name' => 'Hsp_evalue',
878 'Data' => $evalue
882 elsif (
883 ( $self->in_element('hit') || $self->in_element('hsp') )
884 && # paracel hframe BTK
885 m/Score\s*=\s*([^,\s]+), # Raw score
886 \s*Expect\s*=\s*([^,\s]+), # E-value
887 \s*P(?:\(\S+\))?\s*=\s*([^,\s]+) # P-value
891 $self->in_element('hsp')
892 && $self->end_element( { 'Name' => 'Hsp' } );
893 $self->start_element( { 'Name' => 'Hsp' } );
895 $self->debug( "Got paracel hframe HSP score=$1\n");
897 # Some data clean-up so e-value will appear numeric to perl
898 my ( $score, $evalue, $pvalue ) = ( $1, $2, $3 );
899 $evalue = "1$evalue" if $evalue =~ /^e/;
900 $pvalue = "1$pvalue" if $pvalue =~ /^e/;
902 $self->element(
904 'Name' => 'Hsp_score',
905 'Data' => $score
908 $self->element(
910 'Name' => 'Hsp_evalue',
911 'Data' => $evalue
914 $self->element(
916 'Name' => 'Hsp_pvalue',
917 'Data' => $pvalue
921 elsif (
922 ( $self->in_element('hit') || $self->in_element('hsp') )
923 && # wublast
924 m/Score\s*=\s*(\S+)\s* # Bit score
925 \(([\d\.]+)\s*bits\), # Raw score
926 \s*Expect\s*=\s*([^,\s]+), # E-value
927 \s*(?:Sum)?\s* # SUM
928 P(?:\(\d+\))?\s*=\s*([^,\s]+) # P-value
929 (?:\s*,\s+Group\s*\=\s*(\d+))? # HSP Group
932 { # wu-blast HSP parse
933 $self->in_element('hsp')
934 && $self->end_element( { 'Name' => 'Hsp' } );
935 $self->start_element( { 'Name' => 'Hsp' } );
937 # Some data clean-up so e-value will appear numeric to perl
938 my ( $score, $bits, $evalue, $pvalue, $group ) =
939 ( $1, $2, $3, $4, $5 );
940 $evalue =~ s/^e/1e/i;
941 $pvalue =~ s/^e/1e/i;
943 $self->element(
945 'Name' => 'Hsp_score',
946 'Data' => $score
949 $self->element(
951 'Name' => 'Hsp_bit-score',
952 'Data' => $bits
955 $self->element(
957 'Name' => 'Hsp_evalue',
958 'Data' => $evalue
961 $self->element(
963 'Name' => 'Hsp_pvalue',
964 'Data' => $pvalue
968 if ( defined $group ) {
969 $self->element(
971 'Name' => 'Hsp_group',
972 'Data' => $group
978 elsif (
979 ( $self->in_element('hit') || $self->in_element('hsp') )
980 && # ncbi blast, works with 2.2.17
981 m/Score\s*=\s*(\S+)\s*bits\s* # Bit score
982 (?:\((\d+)\))?, # Missing for BLAT pseudo-BLAST fmt
983 \s*Expect(?:\((\d+\+?)\))?\s*=\s*([^,\s]+) # E-value
986 { # parse NCBI blast HSP
987 $self->in_element('hsp')
988 && $self->end_element( { 'Name' => 'Hsp' } );
990 # Some data clean-up so e-value will appear numeric to perl
991 my ( $bits, $score, $n, $evalue ) = ( $1, $2, $3, $4 );
992 $evalue =~ s/^e/1e/i;
993 $self->start_element( { 'Name' => 'Hsp' } );
994 $self->element(
996 'Name' => 'Hsp_score',
997 'Data' => $score
1000 $self->element(
1002 'Name' => 'Hsp_bit-score',
1003 'Data' => $bits
1006 $self->element(
1008 'Name' => 'Hsp_evalue',
1009 'Data' => $evalue
1012 $self->element(
1014 'Name' => 'Hsp_n',
1015 'Data' => $n
1017 ) if defined $n;
1018 $score = '' unless defined $score; # deal with BLAT which
1019 # has no score only bits
1020 $self->debug("Got NCBI HSP score=$score, evalue $evalue\n");
1022 elsif (
1023 $self->in_element('hsp')
1024 && m/Identities\s*=\s*(\d+)\s*\/\s*(\d+)\s*[\d\%\(\)]+\s*
1025 (?:,\s*Positives\s*=\s*(\d+)\/(\d+)\s*[\d\%\(\)]+\s*)? # pos only valid for Protein alignments
1026 (?:\,\s*Gaps\s*=\s*(\d+)\/(\d+))? # Gaps
1027 /oxi
1030 $self->element(
1032 'Name' => 'Hsp_identity',
1033 'Data' => $1
1036 $self->element(
1038 'Name' => 'Hsp_align-len',
1039 'Data' => $2
1042 if ( defined $3 ) {
1043 $self->element(
1045 'Name' => 'Hsp_positive',
1046 'Data' => $3
1050 else {
1051 $self->element(
1053 'Name' => 'Hsp_positive',
1054 'Data' => $1
1058 if ( defined $6 ) {
1059 $self->element(
1061 'Name' => 'Hsp_gaps',
1062 'Data' => $5
1067 $self->{'_Query'} = { 'begin' => 0, 'end' => 0 };
1068 $self->{'_Sbjct'} = { 'begin' => 0, 'end' => 0 };
1070 if (/(Frame\s*=\s*.+)$/) {
1072 # handle wu-blast Frame listing on same line
1073 $self->_pushback($1);
1076 elsif ( $self->in_element('hsp')
1077 && /Strand\s*=\s*(Plus|Minus)\s*\/\s*(Plus|Minus)/i )
1080 # consume this event ( we infer strand from start/end)
1081 unless ($reporttype) {
1082 $self->{'_reporttype'} = $reporttype = 'BLASTN';
1083 $bl2seq_fix = 1; # special case to resubmit the algorithm
1084 # reporttype
1086 next;
1088 elsif ( $self->in_element('hsp')
1089 && /Links\s*=\s*(\S+)/ox )
1091 $self->element(
1093 'Name' => 'Hsp_links',
1094 'Data' => $1
1098 elsif ( $self->in_element('hsp')
1099 && /Frame\s*=\s*([\+\-][1-3])\s*(\/\s*([\+\-][1-3]))?/ )
1102 # this is for bl2seq only
1103 unless ( defined $reporttype ) {
1104 $bl2seq_fix = 1;
1105 if ( $1 && $2 ) { $reporttype = 'TBLASTX' }
1106 else {
1107 $reporttype = 'BLASTX';
1109 # we can't distinguish between BLASTX and TBLASTN straight from the report }
1111 $self->{'_reporttype'} = $reporttype;
1114 my ( $queryframe, $hitframe );
1115 if ( $reporttype eq 'TBLASTX' ) {
1116 ( $queryframe, $hitframe ) = ( $1, $2 );
1117 $hitframe =~ s/\/\s*//g;
1119 elsif ( $reporttype eq 'TBLASTN' || $reporttype eq 'PSITBLASTN') {
1120 ( $hitframe, $queryframe ) = ( $1, 0 );
1122 elsif ( $reporttype eq 'BLASTX' || $reporttype eq 'RPS-BLAST(BLASTP)') {
1123 ( $queryframe, $hitframe ) = ( $1, 0 );
1124 # though NCBI doesn't report it, this is a special BLASTX-like
1125 # RPS-BLAST; should be handled differently
1126 if ($reporttype eq 'RPS-BLAST(BLASTP)') {
1127 $self->element(
1129 'Name' => 'BlastOutput_program',
1130 'Data' => 'RPS-BLAST(BLASTX)'
1135 $self->element(
1137 'Name' => 'Hsp_query-frame',
1138 'Data' => $queryframe
1142 $self->element(
1144 'Name' => 'Hsp_hit-frame',
1145 'Data' => $hitframe
1149 elsif (/^Parameters:/
1150 || /^\s+Database:\s+?/
1151 || /^\s+Subset/
1152 || /^\s*Lambda/
1153 || /^\s*Histogram/
1154 || ( $self->in_element('hsp') && /WARNING|NOTE/ ) )
1157 # Note: Lambda check was necessary to parse
1158 # t/data/ecoli_domains.rpsblast AND to parse bl2seq
1159 $self->debug("blast.pm: found parameters section \n");
1161 $self->in_element('hsp')
1162 && $self->end_element( { 'Name' => 'Hsp' } );
1163 $self->in_element('hit')
1164 && $self->end_element( { 'Name' => 'Hit' } );
1166 # This is for the case when we specify -b 0 (or B=0 for WU-BLAST)
1167 # and still want to construct minimal Hit objects
1168 $self->_cleanup_hits(\@hit_signifs) if scalar(@hit_signifs);
1169 $self->within_element('iteration')
1170 && $self->end_element( { 'Name' => 'Iteration' } );
1172 next if /^\s+Subset/;
1173 my $blast = (/^(\s+Database\:)|(\s*Lambda)/) ? 'ncbi' : 'wublast';
1174 if (/^\s*Histogram/) {
1175 $blast = 'btk';
1178 my $last = '';
1180 # default is that gaps are allowed
1181 $self->element(
1183 'Name' => 'Parameters_allowgaps',
1184 'Data' => 'yes'
1187 while ( defined( $_ = $self->_readline ) ) {
1188 if (
1189 /^((?:\S+)?BLAST[NPX])\s+(.+)$/i # NCBI BLAST, PSIBLAST
1190 # RPSBLAST, MEGABLAST
1191 || /^(P?GENEWISE|HFRAME|SWN|TSWN)\s+(.+)/i #Paracel BTK
1194 $self->_pushback($_);
1196 # let's handle this in the loop
1197 last;
1199 elsif (/^Query=/) {
1200 $self->_pushback($reportline) if $reportline;
1201 $self->_pushback($_);
1202 last PARSER;
1205 # here is where difference between wublast and ncbiblast
1206 # is better handled by different logic
1207 if ( /Number of Sequences:\s+([\d\,]+)/i
1208 || /of sequences in database:\s+(\-?[\d,]+)/i )
1210 my $c = $1;
1211 $c =~ s/\,//g;
1212 $self->element(
1214 'Name' => 'Statistics_db-len',
1215 'Data' => $c
1219 elsif (/letters in database:\s+(\-?[\d,]+)/i) {
1220 my $s = $1;
1221 $s =~ s/,//g;
1222 $self->element(
1224 'Name' => 'Statistics_db-let',
1225 'Data' => $s
1229 elsif ( $blast eq 'btk' ) {
1230 next;
1232 elsif ( $blast eq 'wublast' ) {
1234 # warn($_);
1235 if (/E=(\S+)/) {
1236 $self->element(
1238 'Name' => 'Parameters_expect',
1239 'Data' => $1
1243 elsif (/nogaps/) {
1244 $self->element(
1246 'Name' => 'Parameters_allowgaps',
1247 'Data' => 'no'
1251 elsif (/ctxfactor=(\S+)/) {
1252 $self->element(
1254 'Name' => 'Statistics_ctxfactor',
1255 'Data' => $1
1259 elsif (
1260 /(postsw|links|span[12]?|warnings|notes|gi|noseqs|qres|qype)/
1263 $self->element(
1265 'Name' => "Parameters_$1",
1266 'Data' => 'yes'
1270 elsif (/(\S+)=(\S+)/) {
1271 $self->element(
1273 'Name' => "Parameters_$1",
1274 'Data' => $2
1278 elsif ( $last =~ /(Frame|Strand)\s+MatID\s+Matrix name/i ) {
1279 my $firstgapinfo = 1;
1280 my $frame = undef;
1281 while ( defined($_) && !/^\s+$/ ) {
1282 s/^\s+//;
1283 s/\s+$//;
1284 if ( $firstgapinfo
1285 && s/Q=(\d+),R=(\d+)\s+//x )
1287 $firstgapinfo = 0;
1289 $self->element(
1291 'Name' => 'Parameters_gap-open',
1292 'Data' => $1
1295 $self->element(
1297 'Name' => 'Parameters_gap-extend',
1298 'Data' => $2
1301 my @fields = split;
1303 for my $type (
1304 qw(lambda_gapped
1305 kappa_gapped
1306 entropy_gapped)
1309 next if $type eq 'n/a';
1310 if ( !@fields ) {
1311 warn "fields is empty for $type\n";
1312 next;
1314 $self->element(
1316 'Name' =>
1317 "Statistics_frame$frame\_$type",
1318 'Data' => shift @fields
1323 else {
1324 my ( $frameo, $matid, $matrix, @fields ) =
1325 split;
1326 if ( !defined $frame ) {
1328 # keep some sort of default feature I guess
1329 # even though this is sort of wrong
1330 $self->element(
1332 'Name' => 'Parameters_matrix',
1333 'Data' => $matrix
1336 $self->element(
1338 'Name' => 'Statistics_lambda',
1339 'Data' => $fields[0]
1342 $self->element(
1344 'Name' => 'Statistics_kappa',
1345 'Data' => $fields[1]
1348 $self->element(
1350 'Name' => 'Statistics_entropy',
1351 'Data' => $fields[2]
1355 $frame = $frameo;
1356 my $ii = 0;
1357 for my $type (
1358 qw(lambda_used
1359 kappa_used
1360 entropy_used
1361 lambda_computed
1362 kappa_computed
1363 entropy_computed)
1366 my $f = $fields[$ii];
1367 next unless defined $f; # deal with n/a
1368 if ( $f eq 'same' ) {
1369 $f = $fields[ $ii - 3 ];
1371 $ii++;
1372 $self->element(
1374 'Name' =>
1375 "Statistics_frame$frame\_$type",
1376 'Data' => $f
1383 # get the next line
1384 $_ = $self->_readline;
1386 $last = $_;
1388 elsif ( $last =~ /(Frame|Strand)\s+MatID\s+Length/i ) {
1389 my $frame = undef;
1390 while ( defined($_) && !/^\s+/ ) {
1391 s/^\s+//;
1392 s/\s+$//;
1393 my @fields = split;
1394 if ( @fields <= 3 ) {
1395 for my $type (qw(X_gapped E2_gapped S2)) {
1396 last unless @fields;
1397 $self->element(
1399 'Name' =>
1400 "Statistics_frame$frame\_$type",
1401 'Data' => shift @fields
1406 else {
1408 for my $type (
1409 qw(length
1410 efflength
1411 E S W T X E2 S2)
1414 $self->element(
1416 'Name' =>
1417 "Statistics_frame$frame\_$type",
1418 'Data' => shift @fields
1423 $_ = $self->_readline;
1425 $last = $_;
1427 elsif (/(\S+\s+\S+)\s+DFA:\s+(\S+)\s+\((.+)\)/) {
1428 if ( $1 eq 'states in' ) {
1429 $self->element(
1431 'Name' => 'Statistics_DFA_states',
1432 'Data' => "$2 $3"
1436 elsif ( $1 eq 'size of' ) {
1437 $self->element(
1439 'Name' => 'Statistics_DFA_size',
1440 'Data' => "$2 $3"
1445 elsif (
1446 m/^\s+Time to generate neighborhood:\s+
1447 (\S+\s+\S+\s+\S+)/x
1450 $self->element(
1452 'Name' => 'Statistics_neighbortime',
1453 'Data' => $1
1457 elsif (/processors\s+used:\s+(\d+)/) {
1458 $self->element(
1460 'Name' => 'Statistics_noprocessors',
1461 'Data' => $1
1465 elsif (
1466 m/^\s+(\S+)\s+cpu\s+time:\s+# cputype
1467 (\S+\s+\S+\s+\S+) # cputime
1468 \s+Elapsed:\s+(\S+)/x
1471 my $cputype = lc($1);
1472 $self->element(
1474 'Name' => "Statistics_$cputype\_cputime",
1475 'Data' => $2
1478 $self->element(
1480 'Name' => "Statistics_$cputype\_actualtime",
1481 'Data' => $3
1485 elsif (/^\s+Start:/) {
1486 my ( $junk, $start, $stime, $end, $etime ) =
1487 split( /\s+(Start|End)\:\s+/, $_ );
1488 chomp($stime);
1489 $self->element(
1491 'Name' => 'Statistics_starttime',
1492 'Data' => $stime
1495 chomp($etime);
1496 $self->element(
1498 'Name' => 'Statistics_endtime',
1499 'Data' => $etime
1503 elsif (/^\s+Database:\s+(.+)$/) {
1504 $self->element(
1506 'Name' => 'Parameters_full_dbpath',
1507 'Data' => $1
1512 elsif (/^\s+Posted:\s+(.+)/) {
1513 my $d = $1;
1514 chomp($d);
1515 $self->element(
1517 'Name' => 'Statistics_posted_date',
1518 'Data' => $d
1523 elsif ( $blast eq 'ncbi' ) {
1525 if (m/^Matrix:\s+(.+)\s*$/oxi) {
1526 $self->element(
1528 'Name' => 'Parameters_matrix',
1529 'Data' => $1
1533 elsif (/^Gapped/) {
1534 $gapped_stats = 1;
1536 elsif (/^Lambda/) {
1537 $_ = $self->_readline;
1538 s/^\s+//;
1539 my ( $lambda, $kappa, $entropy ) = split;
1540 if ($gapped_stats) {
1541 $self->element(
1543 'Name' => "Statistics_gapped_lambda",
1544 'Data' => $lambda
1547 $self->element(
1549 'Name' => "Statistics_gapped_kappa",
1550 'Data' => $kappa
1553 $self->element(
1555 'Name' => "Statistics_gapped_entropy",
1556 'Data' => $entropy
1560 else {
1561 $self->element(
1563 'Name' => "Statistics_lambda",
1564 'Data' => $lambda
1567 $self->element(
1569 'Name' => "Statistics_kappa",
1570 'Data' => $kappa
1573 $self->element(
1575 'Name' => "Statistics_entropy",
1576 'Data' => $entropy
1581 elsif (m/effective\s+search\s+space\s+used:\s+(\d+)/ox) {
1582 $self->element(
1584 'Name' => 'Statistics_eff-spaceused',
1585 'Data' => $1
1589 elsif (m/effective\s+search\s+space:\s+(\d+)/ox) {
1590 $self->element(
1592 'Name' => 'Statistics_eff-space',
1593 'Data' => $1
1597 elsif (
1598 m/Gap\s+Penalties:\s+Existence:\s+(\d+)\,
1599 \s+Extension:\s+(\d+)/ox
1602 $self->element(
1604 'Name' => 'Parameters_gap-open',
1605 'Data' => $1
1608 $self->element(
1610 'Name' => 'Parameters_gap-extend',
1611 'Data' => $2
1615 elsif (/effective\s+HSP\s+length:\s+(\d+)/) {
1616 $self->element(
1618 'Name' => 'Statistics_hsp-len',
1619 'Data' => $1
1623 elsif (/effective\s+length\s+of\s+query:\s+([\d\,]+)/) {
1624 my $c = $1;
1625 $c =~ s/\,//g;
1626 $self->element(
1628 'Name' => 'Statistics_query-len',
1629 'Data' => $c
1633 elsif (/effective\s+length\s+of\s+database:\s+([\d\,]+)/) {
1634 my $c = $1;
1635 $c =~ s/\,//g;
1636 $self->element(
1638 'Name' => 'Statistics_eff-dblen',
1639 'Data' => $c
1643 elsif (
1644 /^(T|A|X1|X2|X3|S1|S2):\s+(\d+(\.\d+)?)\s+(?:\(\s*(\d+\.\d+) bits\))?/
1647 my $v = $2;
1648 chomp($v);
1649 $self->element(
1651 'Name' => "Statistics_$1",
1652 'Data' => $v
1655 if ( defined $4 ) {
1656 $self->element(
1658 'Name' => "Statistics_$1_bits",
1659 'Data' => $4
1664 elsif (
1665 m/frameshift\s+window\,
1666 \s+decay\s+const:\s+(\d+)\,\s+([\.\d]+)/x
1669 $self->element(
1671 'Name' => 'Statistics_framewindow',
1672 'Data' => $1
1675 $self->element(
1677 'Name' => 'Statistics_decay',
1678 'Data' => $2
1682 elsif (m/^Number\s+of\s+Hits\s+to\s+DB:\s+(\S+)/ox) {
1683 $self->element(
1685 'Name' => 'Statistics_hit_to_db',
1686 'Data' => $1
1690 elsif (m/^Number\s+of\s+extensions:\s+(\S+)/ox) {
1691 $self->element(
1693 'Name' => 'Statistics_num_extensions',
1694 'Data' => $1
1698 elsif (
1699 m/^Number\s+of\s+successful\s+extensions:\s+
1700 (\S+)/ox
1703 $self->element(
1705 'Name' => 'Statistics_num_suc_extensions',
1706 'Data' => $1
1710 elsif (
1711 m/^Number\s+of\s+sequences\s+better\s+than\s+
1712 (\S+):\s+(\d+)/ox
1715 $self->element(
1717 'Name' => 'Parameters_expect',
1718 'Data' => $1
1721 $self->element(
1723 'Name' => 'Statistics_seqs_better_than_cutoff',
1724 'Data' => $2
1728 elsif (/^\s+Posted\s+date:\s+(.+)/) {
1729 my $d = $1;
1730 chomp($d);
1731 $self->element(
1733 'Name' => 'Statistics_posted_date',
1734 'Data' => $d
1738 elsif ( !/^\s+$/ ) {
1739 #$self->debug( "unmatched stat $_");
1742 $last = $_;
1744 } elsif ( $self->in_element('hsp') ) {
1745 $self->debug("blast.pm: Processing HSP\n");
1746 # let's read 3 lines at a time;
1747 # bl2seq hackiness... Not sure I like
1748 $self->{'_reporttype'} ||= $DEFAULTREPORTTYPE;
1749 my %data = (
1750 'Query' => '',
1751 'Mid' => '',
1752 'Hit' => ''
1754 my $len;
1755 for ( my $i = 0 ; defined($_) && $i < 3 ; $i++ ) {
1756 # $self->debug("$i: $_") if $v;
1757 if ( ( $i == 0 && /^\s+$/) ||
1758 /^\s*(?:Lambda|Minus|Plus|Score)/i )
1760 $self->_pushback($_) if defined $_;
1761 $self->end_element( { 'Name' => 'Hsp' } );
1762 last;
1764 chomp;
1765 if (/^((Query|Sbjct):?\s+(\-?\d+)\s*)(\S+)\s+(\-?\d+)/) {
1766 my ( $full, $type, $start, $str, $end ) =
1767 ( $1, $2, $3, $4, $5 );
1769 if ( $str eq '-' ) {
1770 $i = 3 if $type eq 'Sbjct';
1772 else {
1773 $data{$type} = $str;
1775 $len = length($full);
1776 $self->{"\_$type"}->{'begin'} = $start
1777 unless $self->{"_$type"}->{'begin'};
1778 $self->{"\_$type"}->{'end'} = $end;
1779 } else {
1780 $self->throw("no data for midline $_")
1781 unless ( defined $_ && defined $len );
1782 $data{'Mid'} = substr( $_, $len );
1784 $_ = $self->_readline();
1786 $self->characters(
1788 'Name' => 'Hsp_qseq',
1789 'Data' => $data{'Query'}
1792 $self->characters(
1794 'Name' => 'Hsp_hseq',
1795 'Data' => $data{'Sbjct'}
1798 $self->characters(
1800 'Name' => 'Hsp_midline',
1801 'Data' => $data{'Mid'}
1805 else {
1806 #$self->debug("blast.pm: unrecognized line $_");
1810 $self->debug("blast.pm: End of BlastOutput\n");
1811 if ( $self->{'_seentop'} ) {
1812 $self->within_element('hsp')
1813 && $self->end_element( { 'Name' => 'Hsp' } );
1814 $self->within_element('hit')
1815 && $self->end_element( { 'Name' => 'Hit' } );
1816 # cleanup extra hits
1817 $self->_cleanup_hits(\@hit_signifs) if scalar(@hit_signifs);
1818 $self->within_element('iteration')
1819 && $self->end_element( { 'Name' => 'Iteration' } );
1820 if ($bl2seq_fix) {
1821 $self->element(
1823 'Name' => 'BlastOutput_program',
1824 'Data' => $reporttype
1828 $self->end_element( { 'Name' => 'BlastOutput' } );
1830 return $self->end_document();
1833 # Private method for internal use only.
1834 sub _start_blastoutput {
1835 my $self = shift;
1836 $self->start_element( { 'Name' => 'BlastOutput' } );
1837 $self->{'_seentop'} = 1;
1838 $self->{'_result_count'}++;
1839 $self->{'_handler_rc'} = undef;
1842 =head2 _will_handle
1844 Title : _will_handle
1845 Usage : Private method. For internal use only.
1846 if( $self->_will_handle($type) ) { ... }
1847 Function: Provides an optimized way to check whether or not an element of a
1848 given type is to be handled.
1849 Returns : Reference to EventHandler object if the element type is to be handled.
1850 undef if the element type is not to be handled.
1851 Args : string containing type of element.
1853 Optimizations:
1855 =over 2
1857 =item 1
1859 Using the cached pointer to the EventHandler to minimize repeated
1860 lookups.
1862 =item 2
1864 Caching the will_handle status for each type that is encountered so
1865 that it only need be checked by calling
1866 handler-E<gt>will_handle($type) once.
1868 =back
1870 This does not lead to a major savings by itself (only 5-10%). In
1871 combination with other optimizations, or for large parse jobs, the
1872 savings good be significant.
1874 To test against the unoptimized version, remove the parentheses from
1875 around the third term in the ternary " ? : " operator and add two
1876 calls to $self-E<gt>_eventHandler().
1878 =cut
1880 sub _will_handle {
1881 my ( $self, $type ) = @_;
1882 my $handler = $self->{'_handler_cache'};
1883 my $will_handle =
1884 defined( $self->{'_will_handle_cache'}->{$type} )
1885 ? $self->{'_will_handle_cache'}->{$type}
1886 : ( $self->{'_will_handle_cache'}->{$type} =
1887 $handler->will_handle($type) );
1889 return $will_handle ? $handler : undef;
1892 =head2 start_element
1894 Title : start_element
1895 Usage : $eventgenerator->start_element
1896 Function: Handles a start element event
1897 Returns : none
1898 Args : hashref with at least 2 keys 'Data' and 'Name'
1900 =cut
1902 sub start_element {
1903 my ( $self, $data ) = @_;
1905 # we currently don't care about attributes
1906 my $nm = $data->{'Name'};
1907 my $type = $MODEMAP{$nm};
1908 if ($type) {
1909 my $handler = $self->_will_handle($type);
1910 if ($handler) {
1911 my $func = sprintf( "start_%s", lc $type );
1912 $self->{'_handler_rc'} = $handler->$func( $data->{'Attributes'} );
1914 #else {
1915 #$self->debug( # changed 4/29/2006 to play nice with other event handlers
1916 # "Bio::SearchIO::InternalParserError ".
1917 # "\nCan't handle elements of type \'$type.\'"
1920 unshift @{ $self->{'_elements'} }, $type;
1921 if ( $type eq 'result' ) {
1922 $self->{'_values'} = {};
1923 $self->{'_result'} = undef;
1924 } else {
1925 # cleanup some things
1926 if ( defined $self->{'_values'} ) {
1927 foreach my $k (
1928 grep { /^\U$type\-/ }
1929 keys %{ $self->{'_values'} }
1932 delete $self->{'_values'}->{$k};
1939 =head2 end_element
1941 Title : end_element
1942 Usage : $eventgenerator->end_element
1943 Function: Handles an end element event
1944 Returns : hashref with an element's worth of data
1945 Args : hashref with at least 2 keys 'Data' and 'Name'
1948 =cut
1950 sub end_element {
1951 my ( $self, $data ) = @_;
1953 my $nm = $data->{'Name'};
1954 my $type = $MODEMAP{$nm};
1955 my $rc;
1956 if ( $nm eq 'BlastOutput_program' ) {
1957 if ( $self->{'_last_data'} =~ /(t?blast[npx])/i ) {
1958 $self->{'_reporttype'} = uc $1;
1960 $self->{'_reporttype'} ||= $DEFAULTREPORTTYPE;
1963 # Hsps are sort of weird, in that they end when another
1964 # object begins so have to detect this in end_element for now
1965 if ( $nm eq 'Hsp' ) {
1966 foreach (qw(Hsp_qseq Hsp_midline Hsp_hseq)) {
1967 $self->element(
1969 'Name' => $_,
1970 'Data' => $self->{'_last_hspdata'}->{$_}
1974 $self->{'_last_hspdata'} = {};
1975 $self->element(
1977 'Name' => 'Hsp_query-from',
1978 'Data' => $self->{'_Query'}->{'begin'}
1981 $self->element(
1983 'Name' => 'Hsp_query-to',
1984 'Data' => $self->{'_Query'}->{'end'}
1988 $self->element(
1990 'Name' => 'Hsp_hit-from',
1991 'Data' => $self->{'_Sbjct'}->{'begin'}
1994 $self->element(
1996 'Name' => 'Hsp_hit-to',
1997 'Data' => $self->{'_Sbjct'}->{'end'}
2001 # } elsif( $nm eq 'Iteration' ) {
2002 # Nothing special needs to be done here.
2004 if ( $type = $MODEMAP{$nm} ) {
2005 my $handler = $self->_will_handle($type);
2006 if ($handler) {
2007 my $func = sprintf( "end_%s", lc $type );
2008 $rc = $handler->$func( $self->{'_reporttype'}, $self->{'_values'} );
2010 shift @{ $self->{'_elements'} };
2013 elsif ( $MAPPING{$nm} ) {
2015 if ( ref( $MAPPING{$nm} ) =~ /hash/i ) {
2017 # this is where we shove in the data from the
2018 # hashref info about params or statistics
2019 my $key = ( keys %{ $MAPPING{$nm} } )[0];
2020 $self->{'_values'}->{$key}->{ $MAPPING{$nm}->{$key} } =
2021 $self->{'_last_data'};
2023 else {
2024 $self->{'_values'}->{ $MAPPING{$nm} } = $self->{'_last_data'};
2027 else {
2028 #$self->debug("blast.pm: unknown nm $nm, ignoring\n");
2030 $self->{'_last_data'} = ''; # remove read data if we are at
2031 # end of an element
2032 $self->{'_result'} = $rc if ( defined $type && $type eq 'result' );
2033 return $rc;
2036 =head2 element
2038 Title : element
2039 Usage : $eventhandler->element({'Name' => $name, 'Data' => $str});
2040 Function: Convenience method that calls start_element, characters, end_element
2041 Returns : none
2042 Args : Hash ref with the keys 'Name' and 'Data'
2045 =cut
2047 sub element {
2048 my ( $self, $data ) = @_;
2049 # Note that start element isn't needed for character data
2050 # Not too SAX-y, though
2051 #$self->start_element($data);
2052 $self->characters($data);
2053 $self->end_element($data);
2056 =head2 characters
2058 Title : characters
2059 Usage : $eventgenerator->characters($str)
2060 Function: Send a character events
2061 Returns : none
2062 Args : string
2065 =cut
2067 sub characters {
2068 my ( $self, $data ) = @_;
2069 if ( $self->in_element('hsp')
2070 && $data->{'Name'} =~ /^Hsp\_(qseq|hseq|midline)$/ )
2072 $self->{'_last_hspdata'}->{ $data->{'Name'} } .= $data->{'Data'}
2073 if defined $data->{'Data'};
2075 return unless ( defined $data->{'Data'} && $data->{'Data'} !~ /^\s+$/ );
2076 $self->{'_last_data'} = $data->{'Data'};
2079 =head2 within_element
2081 Title : within_element
2082 Usage : if( $eventgenerator->within_element($element) ) {}
2083 Function: Test if we are within a particular element
2084 This is different than 'in' because within can be tested
2085 for a whole block.
2086 Returns : boolean
2087 Args : string element name
2089 See Also: L<in_element>
2091 =cut
2093 sub within_element {
2094 my ( $self, $name ) = @_;
2095 return 0
2096 if ( !defined $name && !defined $self->{'_elements'}
2097 || scalar @{ $self->{'_elements'} } == 0 );
2098 foreach ( @{ $self->{'_elements'} } ) {
2099 if ( $_ eq $name ) {
2100 return 1;
2103 return 0;
2106 =head2 in_element
2108 Title : in_element
2109 Usage : if( $eventgenerator->in_element($element) ) {}
2110 Function: Test if we are in a particular element
2111 This is different than 'within_element' because within
2112 can be tested for a whole block.
2113 Returns : boolean
2114 Args : string element name
2116 See Also: L<within_element>
2118 =cut
2120 sub in_element {
2121 my ( $self, $name ) = @_;
2122 return 0 if !defined $self->{'_elements'}->[0];
2123 return ( $self->{'_elements'}->[0] eq $name );
2126 =head2 start_document
2128 Title : start_document
2129 Usage : $eventgenerator->start_document
2130 Function: Handle a start document event
2131 Returns : none
2132 Args : none
2135 =cut
2137 sub start_document {
2138 my ($self) = @_;
2139 $self->{'_lasttype'} = '';
2140 $self->{'_values'} = {};
2141 $self->{'_result'} = undef;
2142 $self->{'_elements'} = [];
2145 =head2 end_document
2147 Title : end_document
2148 Usage : $eventgenerator->end_document
2149 Function: Handles an end document event
2150 Returns : Bio::Search::Result::ResultI object
2151 Args : none
2154 =cut
2156 sub end_document {
2157 my ( $self, @args ) = @_;
2159 #$self->debug("blast.pm: end_document\n");
2160 return $self->{'_result'};
2163 sub write_result {
2164 my ( $self, $blast, @args ) = @_;
2166 if ( not defined( $self->writer ) ) {
2167 $self->warn("Writer not defined. Using a $DEFAULT_BLAST_WRITER_CLASS");
2168 $self->writer( $DEFAULT_BLAST_WRITER_CLASS->new() );
2170 $self->SUPER::write_result( $blast, @args );
2173 sub result_count {
2174 my $self = shift;
2175 return $self->{'_result_count'};
2178 sub report_count { shift->result_count }
2180 =head2 inclusion_threshold
2182 Title : inclusion_threshold
2183 Usage : my $incl_thresh = $isreb->inclusion_threshold;
2184 : $isreb->inclusion_threshold(1e-5);
2185 Function: Get/Set the e-value threshold for inclusion in the PSI-BLAST
2186 score matrix model (blastpgp) that was used for generating the reports
2187 being parsed.
2188 Returns : number (real)
2189 Default value: $Bio::SearchIO::IteratedSearchResultEventBuilder::DEFAULT_INCLUSION_THRESHOLD
2190 Args : number (real) (e.g., 0.0001 or 1e-4 )
2192 =cut
2194 # Delegates to the event handler.
2195 sub inclusion_threshold {
2196 shift->_eventHandler->inclusion_threshold(@_);
2199 =head2 max_significance
2201 Usage : $obj->max_significance();
2202 Purpose : Set/Get the P or Expect value used as significance screening cutoff.
2203 This is the value of the -signif parameter supplied to new().
2204 Hits with P or E-value above this are skipped.
2205 Returns : Scientific notation number with this format: 1.0e-05.
2206 Argument : Scientific notation number or float (when setting)
2207 Comments : Screening of significant hits uses the data provided on the
2208 : description line. For NCBI BLAST1 and WU-BLAST, this data
2209 : is P-value. for NCBI BLAST2 it is an Expect value.
2211 =cut
2213 sub max_significance { shift->{'_handler_cache'}->max_significance(@_) }
2215 =head2 signif
2217 Synonym for L<max_significance()|max_significance>
2219 =cut
2221 sub signif { shift->max_significance(@_) }
2223 =head2 min_score
2225 Usage : $obj->min_score();
2226 Purpose : Set/Get the Blast score used as screening cutoff.
2227 This is the value of the -score parameter supplied to new().
2228 Hits with scores below this are skipped.
2229 Returns : Integer or scientific notation number.
2230 Argument : Integer or scientific notation number (when setting)
2231 Comments : Screening of significant hits uses the data provided on the
2232 : description line.
2234 =cut
2236 sub min_score { shift->{'_handler_cache'}->max_significance(@_) }
2238 =head2 min_query_length
2240 Usage : $obj->min_query_length();
2241 Purpose : Gets the query sequence length used as screening criteria.
2242 This is the value of the -min_query_len parameter supplied to new().
2243 Hits with sequence length below this are skipped.
2244 Returns : Integer
2245 Argument : n/a
2247 =cut
2249 sub min_query_length {
2250 my $self = shift;
2251 if (@_) {
2252 my $min_qlen = shift;
2253 if ( $min_qlen =~ /\D/ or $min_qlen <= 0 ) {
2254 $self->throw(
2255 -class => 'Bio::Root::BadParameter',
2256 -text => "Invalid minimum query length value: $min_qlen\n"
2257 . "Value must be an integer > 0. Value not set.",
2258 -value => $min_qlen
2261 $self->{'_confirm_qlength'} = 1;
2262 $self->{'_min_query_length'} = $min_qlen;
2265 return $self->{'_min_query_length'};
2268 =head2 best_hit_only
2270 Title : best_hit_only
2271 Usage : print "only getting best hit.\n" if $obj->best_hit_only;
2272 Purpose : Set/Get the indicator for whether or not to process only
2273 : the best BlastHit.
2274 Returns : Boolean (1 | 0)
2275 Argument : Boolean (1 | 0) (when setting)
2277 =cut
2279 sub best_hit_only {
2280 my $self = shift;
2281 if (@_) { $self->{'_best'} = shift; }
2282 $self->{'_best'};
2285 =head2 check_all_hits
2287 Title : check_all_hits
2288 Usage : print "checking all hits.\n" if $obj->check_all_hits;
2289 Purpose : Set/Get the indicator for whether or not to process all hits.
2290 : If false, the parser will stop processing hits after the
2291 : the first non-significance hit or the first hit that fails
2292 : any hit filter.
2293 Returns : Boolean (1 | 0)
2294 Argument : Boolean (1 | 0) (when setting)
2296 =cut
2298 sub check_all_hits {
2299 my $self = shift;
2300 if (@_) { $self->{'_check_all'} = shift; }
2301 $self->{'_check_all'};
2304 # commented out, using common base class util method
2305 #=head2 _get_accession_version
2307 # Title : _get_accession_version
2308 # Usage : my ($acc,$ver) = &_get_accession_version($id)
2309 # Function:Private function to get an accession,version pair
2310 # for an ID (if it is in NCBI format)
2311 # Returns : 2-pule of accession, version
2312 # Args : ID string to process
2315 #=cut
2317 #sub _get_accession_version {
2318 # my $id = shift;
2320 # # handle case when this is accidently called as a class method
2321 # if ( ref($id) && $id->isa('Bio::SearchIO') ) {
2322 # $id = shift;
2324 # return unless defined $id;
2325 # my ( $acc, $version );
2326 # if ( $id =~ /(gb|emb|dbj|sp|pdb|bbs|ref|lcl)\|(.*)\|(.*)/ ) {
2327 # ( $acc, $version ) = split /\./, $2;
2329 # elsif ( $id =~ /(pir|prf|pat|gnl)\|(.*)\|(.*)/ ) {
2330 # ( $acc, $version ) = split /\./, $3;
2332 # else {
2334 # #punt, not matching the db's at ftp://ftp.ncbi.nih.gov/blast/db/README
2335 # #Database Name Identifier Syntax
2336 # #============================ ========================
2337 # #GenBank gb|accession|locus
2338 # #EMBL Data Library emb|accession|locus
2339 # #DDBJ, DNA Database of Japan dbj|accession|locus
2340 # #NBRF PIR pir||entry
2341 # #Protein Research Foundation prf||name
2342 # #SWISS-PROT sp|accession|entry name
2343 # #Brookhaven Protein Data Bank pdb|entry|chain
2344 # #Patents pat|country|number
2345 # #GenInfo Backbone Id bbs|number
2346 # #General database identifier gnl|database|identifier
2347 # #NCBI Reference Sequence ref|accession|locus
2348 # #Local Sequence identifier lcl|identifier
2349 # $acc = $id;
2351 # return ( $acc, $version );
2354 # general private method used to make minimal hits from leftover
2355 # data in the hit table
2357 sub _cleanup_hits {
2358 my ($self, $hits) = @_;
2359 while ( my $v = shift @{ $hits }) {
2360 next unless defined $v;
2361 $self->start_element( { 'Name' => 'Hit' } );
2362 my $id = $v->[2];
2363 my $desc = $v->[3];
2364 $self->element(
2366 'Name' => 'Hit_id',
2367 'Data' => $id
2370 my ($gi, $acc, $version ) = $self->_get_seq_identifiers($id);
2371 $self->element(
2373 'Name' => 'Hit_accession',
2374 'Data' => $acc
2377 if ( defined $v ) {
2378 $self->element(
2380 'Name' => 'Hit_signif',
2381 'Data' => $v->[0]
2384 if (exists $self->{'_wublast'}) {
2385 $self->element(
2387 'Name' => 'Hit_score',
2388 'Data' => $v->[1]
2391 } else {
2392 $self->element(
2394 'Name' => 'Hit_bits',
2395 'Data' => $v->[1]
2401 $self->element(
2403 'Name' => 'Hit_def',
2404 'Data' => $desc
2407 $self->end_element( { 'Name' => 'Hit' } );
2414 __END__
2416 Developer Notes
2417 ---------------
2419 The following information is added in hopes of increasing the
2420 maintainability of this code. It runs the risk of becoming obsolete as
2421 the code gets updated. As always, double check against the actual
2422 source. If you find any discrepencies, please correct them.
2423 [ This documentation added on 3 Jun 2003. ]
2425 The logic is the brainchild of Jason Stajich, documented by Steve
2426 Chervitz. Jason: please check it over and modify as you see fit.
2428 Question:
2429 Elmo wants to know: How does this module unmarshall data from the input stream?
2430 (i.e., how does information from a raw input file get added to
2431 the correct Bioperl object?)
2433 Answer:
2435 This answer is specific to SearchIO::blast, but may apply to other
2436 SearchIO.pm subclasses as well. The following description gives the
2437 basic idea. The actual processing is a little more complex for
2438 certain types of data (HSP, Report Parameters).
2440 You can think of blast::next_result() as faking a SAX XML parser,
2441 making a non-XML document behave like its XML. The overhead to do this
2442 is quite substantial (~650 lines of code instead of ~80 in
2443 blastxml.pm).
2445 0. First, add a key => value pair for the datum of interest to %MAPPING
2446 Example:
2447 'Foo_bar' => 'Foo-bar',
2449 1. next_result() collects the datum of interest from the input stream,
2450 and calls element().
2451 Example:
2452 $self->element({ 'Name' => 'Foo_bar',
2453 'Data' => $foobar});
2455 2. The element() method is a convenience method that calls start_element(),
2456 characters(), and end_element().
2458 3. start_element() checks to see if the event handler can handle a start_xxx(),
2459 where xxx = the 'Name' parameter passed into element(), and calls start_xxx()
2460 if so. Otherwise, start_element() does not do anything.
2462 Data that will have such an event handler are defined in %MODEMAP.
2463 Typically, there are only handler methods for the main parts of
2464 the search result (e.g., Result, Iteration, Hit, HSP),
2465 which have corresponding Bioperl modules. So in this example,
2466 there was an earlier call such as $self->element({'Name'=>'Foo'})
2467 and the Foo_bar datum is meant to ultimately go into a Foo object.
2469 The start_foo() method in the handler will typically do any
2470 data initialization necessary to prepare for creating a new Foo object.
2471 Example: SearchResultEventBuilder::start_result()
2473 4. characters() takes the value of the 'Data' key from the hashref argument in
2474 the elements() call and saves it in a local data member:
2475 Example:
2476 $self->{'_last_data'} = $data->{'Data'};
2478 5. end_element() is like start_element() in that it does the check for whether
2479 the event handler can handle end_xxx() and if so, calls it, passing in
2480 the data collected from all of the characters() calls that occurred
2481 since the start_xxx() call.
2483 If there isn't any special handler for the data type specified by 'Name',
2484 end_element() will place the data saved by characters() into another
2485 local data member that saves it in a hash with a key defined by %MAPPING.
2486 Example:
2487 $nm = $data->{'Name'};
2488 $self->{'_values'}->{$MAPPING{$nm}} = $self->{'_last_data'};
2490 In this case, $MAPPING{$nm} is 'Foo-bar'.
2492 end_element() finishes by resetting the local data member used by
2493 characters(). (i.e., $self->{'_last_data'} = '';)
2495 6. When the next_result() method encounters the end of the Foo element in the
2496 input stream. It will invoke $self->end_element({'Name'=>'Foo'}).
2497 end_element() then sends all of the data in the $self->{'_values'} hash.
2498 Note that $self->{'_values'} is cleaned out during start_element(),
2499 keeping it at a resonable size.
2501 In the event handler, the end_foo() method takes the hash from end_element()
2502 and creates a new hash containing the same data, but having keys lacking
2503 the 'Foo' prefix (e.g., 'Foo-bar' becomes '-bar'). The handler's end_foo()
2504 method then creates the Foo object, passing in this new hash as an argument.
2505 Example: SearchResultEventBuilder::end_result()
2507 7. Objects created from the data in the search result are managed by
2508 the event handler which adds them to a ResultI object (using API methods
2509 for that object). The ResultI object gets passed back to
2510 SearchIO::end_element() when it calls end_result().
2512 The ResultI object is then saved in an internal data member of the
2513 SearchIO object, which returns it at the end of next_result()
2514 by calling end_document().
2516 (Technical Note: All objects created by end_xxx() methods in the event
2517 handler are returned to SearchIO::end_element(), but the SearchIO object
2518 only cares about the ResultI objects.)
2520 (Sesame Street aficionados note: This answer was NOT given by Mr. Noodle ;-P)