Remove unused
[bioperl-live.git] / Bio / SearchIO / blast.pm
blobf490118fc53c71cd8671d129a5df9ba1c04a530a
2 # BioPerl module for Bio::SearchIO::blast
4 # Please direct questions and support issues to <bioperl-l@bioperl.org>
6 # Cared for by Jason Stajich <jason@bioperl.org>
8 # Copyright Jason Stajich
10 # You may distribute this module under the same terms as perl itself
12 # POD documentation - main docs before the code
14 # 20030409 - sac
15 # PSI-BLAST full parsing support. Rollout of new
16 # model which will remove Steve's old psiblast driver
17 # 20030424 - jason
18 # Megablast parsing fix as reported by Neil Saunders
19 # 20030427 - jason
20 # Support bl2seq parsing
21 # 20031124 - jason
22 # Parse more blast statistics, lambda, entropy, etc
23 # from WU-BLAST in frame-specific manner
24 # 20060216 - cjf - fixed blast parsing for BLAST v2.2.13 output
25 # 20071104 - dmessina - added support for WUBLAST -echofilter
26 # 20071121 - cjf - fixed several bugs (bugs 2391, 2399, 2409)
28 =head1 NAME
30 Bio::SearchIO::blast - Event generator for event based parsing of
31 blast reports
33 =head1 SYNOPSIS
35 # Do not use this object directly - it is used as part of the
36 # Bio::SearchIO system.
38 use Bio::SearchIO;
39 my $searchio = Bio::SearchIO->new(-format => 'blast',
40 -file => 't/data/ecolitst.bls');
41 while( my $result = $searchio->next_result ) {
42 while( my $hit = $result->next_hit ) {
43 while( my $hsp = $hit->next_hsp ) {
44 # ...
49 =head1 DESCRIPTION
51 This object encapsulated the necessary methods for generating events
52 suitable for building Bio::Search objects from a BLAST report file.
53 Read the L<Bio::SearchIO> for more information about how to use this.
55 This driver can parse:
57 =over 4
59 =item *
61 NCBI produced plain text BLAST reports from blastall, this also
62 includes PSIBLAST, PSITBLASTN, RPSBLAST, and bl2seq reports. NCBI XML
63 BLAST output is parsed with the blastxml SearchIO driver
65 =item *
67 WU-BLAST all reports
69 =item *
71 Jim Kent's BLAST-like output from his programs (BLASTZ, BLAT)
73 =item *
75 BLAST-like output from Paracel BTK output
77 =back
79 =head2 bl2seq parsing
81 Since I cannot differentiate between BLASTX and TBLASTN since bl2seq
82 doesn't report the algorithm used - I assume it is BLASTX by default -
83 you can supply the program type with -report_type in the SearchIO
84 constructor i.e.
86 my $parser = Bio::SearchIO->new(-format => 'blast',
87 -file => 'bl2seq.tblastn.report',
88 -report_type => 'tblastn');
90 This only really affects where the frame and strand information are
91 put - they will always be on the $hsp-E<gt>query instead of on the
92 $hsp-E<gt>hit part of the feature pair for blastx and tblastn bl2seq
93 produced reports. Hope that's clear...
95 =head1 FEEDBACK
97 =head2 Mailing Lists
99 User feedback is an integral part of the evolution of this and other
100 Bioperl modules. Send your comments and suggestions preferably to
101 the Bioperl mailing list. Your participation is much appreciated.
103 bioperl-l@bioperl.org - General discussion
104 http://bioperl.org/wiki/Mailing_lists - About the mailing lists
106 =head2 Support
108 Please direct usage questions or support issues to the mailing list:
110 I<bioperl-l@bioperl.org>
112 rather than to the module maintainer directly. Many experienced and
113 reponsive experts will be able look at the problem and quickly
114 address it. Please include a thorough description of the problem
115 with code and data examples if at all possible.
117 =head2 Reporting Bugs
119 Report bugs to the Bioperl bug tracking system to help us keep track
120 of the bugs and their resolution. Bug reports can be submitted via the
121 web:
123 https://redmine.open-bio.org/projects/bioperl/
125 =head1 AUTHOR - Jason Stajich
127 Email Jason Stajich jason-at-bioperl.org
129 =head1 CONTRIBUTORS
131 Steve Chervitz sac-at-bioperl.org
133 =head1 APPENDIX
135 The rest of the documentation details each of the object methods.
136 Internal methods are usually preceded with a _
138 =cut
140 # Let the code begin...'
142 package Bio::SearchIO::blast;
144 use Bio::SearchIO::IteratedSearchResultEventBuilder;
145 use strict;
146 use vars qw(%MAPPING %MODEMAP
147 $DEFAULT_BLAST_WRITER_CLASS
148 $MAX_HSP_OVERLAP
149 $DEFAULT_SIGNIF
150 $DEFAULT_SCORE
151 $DEFAULTREPORTTYPE
155 use base qw(Bio::SearchIO);
156 use Data::Dumper;
158 BEGIN {
160 # mapping of NCBI Blast terms to Bioperl hash keys
161 %MODEMAP = (
162 'BlastOutput' => 'result',
163 'Iteration' => 'iteration',
164 'Hit' => 'hit',
165 'Hsp' => 'hsp'
168 # This should really be done more intelligently, like with
169 # XSLT
171 %MAPPING = (
172 'Hsp_bit-score' => 'HSP-bits',
173 'Hsp_score' => 'HSP-score',
174 'Hsp_evalue' => 'HSP-evalue',
175 'Hsp_n', => 'HSP-n',
176 'Hsp_pvalue' => 'HSP-pvalue',
177 'Hsp_query-from' => 'HSP-query_start',
178 'Hsp_query-to' => 'HSP-query_end',
179 'Hsp_hit-from' => 'HSP-hit_start',
180 'Hsp_hit-to' => 'HSP-hit_end',
181 'Hsp_positive' => 'HSP-conserved',
182 'Hsp_identity' => 'HSP-identical',
183 'Hsp_gaps' => 'HSP-hsp_gaps',
184 'Hsp_hitgaps' => 'HSP-hit_gaps',
185 'Hsp_querygaps' => 'HSP-query_gaps',
186 'Hsp_qseq' => 'HSP-query_seq',
187 'Hsp_hseq' => 'HSP-hit_seq',
188 'Hsp_midline' => 'HSP-homology_seq',
189 'Hsp_align-len' => 'HSP-hsp_length',
190 'Hsp_query-frame' => 'HSP-query_frame',
191 'Hsp_hit-frame' => 'HSP-hit_frame',
192 'Hsp_links' => 'HSP-links',
193 'Hsp_group' => 'HSP-hsp_group',
194 'Hsp_features' => 'HSP-hit_features',
196 'Hit_id' => 'HIT-name',
197 'Hit_len' => 'HIT-length',
198 'Hit_accession' => 'HIT-accession',
199 'Hit_def' => 'HIT-description',
200 'Hit_signif' => 'HIT-significance',
201 # For NCBI blast, the description line contains bits.
202 # For WU-blast, the description line contains score.
203 'Hit_score' => 'HIT-score',
204 'Hit_bits' => 'HIT-bits',
206 'Iteration_iter-num' => 'ITERATION-number',
207 'Iteration_converged' => 'ITERATION-converged',
209 'BlastOutput_program' => 'RESULT-algorithm_name',
210 'BlastOutput_version' => 'RESULT-algorithm_version',
211 'BlastOutput_algorithm-reference' => 'RESULT-algorithm_reference',
212 'BlastOutput_rid' => 'RESULT-rid',
213 'BlastOutput_query-def' => 'RESULT-query_name',
214 'BlastOutput_query-len' => 'RESULT-query_length',
215 'BlastOutput_query-acc' => 'RESULT-query_accession',
216 'BlastOutput_query-gi' => 'RESULT-query_gi',
217 'BlastOutput_querydesc' => 'RESULT-query_description',
218 'BlastOutput_db' => 'RESULT-database_name',
219 'BlastOutput_db-len' => 'RESULT-database_entries',
220 'BlastOutput_db-let' => 'RESULT-database_letters',
221 'BlastOutput_inclusion-threshold' => 'RESULT-inclusion_threshold',
223 'Parameters_matrix' => { 'RESULT-parameters' => 'matrix' },
224 'Parameters_expect' => { 'RESULT-parameters' => 'expect' },
225 'Parameters_include' => { 'RESULT-parameters' => 'include' },
226 'Parameters_sc-match' => { 'RESULT-parameters' => 'match' },
227 'Parameters_sc-mismatch' => { 'RESULT-parameters' => 'mismatch' },
228 'Parameters_gap-open' => { 'RESULT-parameters' => 'gapopen' },
229 'Parameters_gap-extend' => { 'RESULT-parameters' => 'gapext' },
230 'Parameters_filter' => { 'RESULT-parameters' => 'filter' },
231 'Parameters_allowgaps' => { 'RESULT-parameters' => 'allowgaps' },
232 'Parameters_full_dbpath' => { 'RESULT-parameters' => 'full_dbpath' },
233 'Statistics_db-len' => { 'RESULT-statistics' => 'dbentries' },
234 'Statistics_db-let' => { 'RESULT-statistics' => 'dbletters' },
235 'Statistics_hsp-len' =>
236 { 'RESULT-statistics' => 'effective_hsplength' },
237 'Statistics_query-len' => { 'RESULT-statistics' => 'querylength' },
238 'Statistics_eff-space' => { 'RESULT-statistics' => 'effectivespace' },
239 'Statistics_eff-spaceused' =>
240 { 'RESULT-statistics' => 'effectivespaceused' },
241 'Statistics_eff-dblen' =>
242 { 'RESULT-statistics' => 'effectivedblength' },
243 'Statistics_kappa' => { 'RESULT-statistics' => 'kappa' },
244 'Statistics_lambda' => { 'RESULT-statistics' => 'lambda' },
245 'Statistics_entropy' => { 'RESULT-statistics' => 'entropy' },
246 'Statistics_gapped_kappa' => { 'RESULT-statistics' => 'kappa_gapped' },
247 'Statistics_gapped_lambda' =>
248 { 'RESULT-statistics' => 'lambda_gapped' },
249 'Statistics_gapped_entropy' =>
250 { 'RESULT-statistics' => 'entropy_gapped' },
252 'Statistics_framewindow' =>
253 { 'RESULT-statistics' => 'frameshiftwindow' },
254 'Statistics_decay' => { 'RESULT-statistics' => 'decayconst' },
256 'Statistics_hit_to_db' => { 'RESULT-statistics' => 'Hits_to_DB' },
257 'Statistics_num_suc_extensions' =>
258 { 'RESULT-statistics' => 'num_successful_extensions' },
259 'Statistics_length_adjustment' => { 'RESULT-statistics' => 'length_adjustment' },
260 'Statistics_number_of_hsps_better_than_expect_value_cutoff_without_gapping' =>
261 { 'RESULT-statistics' => 'number_of_hsps_better_than_expect_value_cutoff_without_gapping' },
262 'Statistics_number_of_hsps_gapped' => { 'RESULT-statistics' => 'number_of_hsps_gapped' },
263 'Statistics_number_of_hsps_successfully_gapped' => { 'RESULT-statistics' => 'number_of_hsps_successfully_gapped' },
265 # WU-BLAST stats
266 'Statistics_DFA_states' => { 'RESULT-statistics' => 'num_dfa_states' },
267 'Statistics_DFA_size' => { 'RESULT-statistics' => 'dfa_size' },
268 'Statistics_noprocessors' =>
269 { 'RESULT-statistics' => 'no_of_processors' },
270 'Statistics_neighbortime' =>
271 { 'RESULT-statistics' => 'neighborhood_generate_time' },
272 'Statistics_starttime' => { 'RESULT-statistics' => 'start_time' },
273 'Statistics_endtime' => { 'RESULT-statistics' => 'end_time' },
276 # add WU-BLAST Frame-Based Statistics
277 for my $frame ( 0 .. 3 ) {
278 for my $strand ( '+', '-' ) {
279 for my $ind (
280 qw(length efflength E S W T X X_gapped E2
281 E2_gapped S2)
284 $MAPPING{"Statistics_frame$strand$frame\_$ind"} =
285 { 'RESULT-statistics' => "Frame$strand$frame\_$ind" };
287 for my $val (qw(lambda kappa entropy )) {
288 for my $type (qw(used computed gapped)) {
289 my $key = "Statistics_frame$strand$frame\_$val\_$type";
290 my $val =
291 { 'RESULT-statistics' =>
292 "Frame$strand$frame\_$val\_$type" };
293 $MAPPING{$key} = $val;
299 # add Statistics
300 for my $stats (
301 qw(T A X1 X2 X3 S1 S2 X1_bits X2_bits X3_bits
302 S1_bits S2_bits num_extensions
303 num_successful_extensions
304 seqs_better_than_cutoff
305 posted_date
306 search_cputime total_cputime
307 search_actualtime total_actualtime
308 no_of_processors ctxfactor)
311 my $key = "Statistics_$stats";
312 my $val = { 'RESULT-statistics' => $stats };
313 $MAPPING{$key} = $val;
316 # add WU-BLAST Parameters
317 for my $param (
318 qw(span span1 span2 links warnings notes hspsepsmax
319 hspsepqmax topcomboN topcomboE postsw cpus wordmask
320 filter sort_by_pvalue sort_by_count sort_by_highscore
321 sort_by_totalscore sort_by_subjectlength noseqs gi qtype
322 qres V B Z Y M N)
325 my $key = "Parameters_$param";
326 my $val = { 'RESULT-parameters' => $param };
327 $MAPPING{$key} = $val;
330 $DEFAULT_BLAST_WRITER_CLASS = 'Bio::SearchIO::Writer::HitTableWriter';
331 $MAX_HSP_OVERLAP = 2; # Used when tiling multiple HSPs.
332 $DEFAULTREPORTTYPE = 'BLASTP'; # for bl2seq
335 =head2 new
337 Title : new
338 Usage : my $obj = Bio::SearchIO::blast->new(%args);
339 Function: Builds a new Bio::SearchIO::blast object
340 Returns : Bio::SearchIO::blast
341 Args : Key-value pairs:
342 -fh/-file => filehandle/filename to BLAST file
343 -format => 'blast'
344 -report_type => 'blastx', 'tblastn', etc -- only for bl2seq
345 reports when you want to distinguish between
346 tblastn and blastx reports (this only controls
347 where the frame information is put - on the query
348 or subject object.
349 -inclusion_threshold => e-value threshold for inclusion in the
350 PSI-BLAST score matrix model (blastpgp)
351 -signif => float or scientific notation number to be used
352 as a P- or Expect value cutoff
353 -score => integer or scientific notation number to be used
354 as a blast score value cutoff
355 -bits => integer or scientific notation number to be used
356 as a bit score value cutoff
357 -hit_filter => reference to a function to be used for
358 filtering hits based on arbitrary criteria.
359 All hits of each BLAST report must satisfy
360 this criteria to be retained.
361 If a hit fails this test, it is ignored.
362 This function should take a
363 Bio::Search::Hit::BlastHit.pm object as its first
364 argument and return true
365 if the hit should be retained.
366 Sample filter function:
367 -hit_filter => sub { $hit = shift;
368 $hit->gaps == 0; },
369 (Note: -filt_func is synonymous with -hit_filter)
370 -overlap => integer. The amount of overlap to permit between
371 adjacent HSPs when tiling HSPs. A reasonable value is 2.
372 Default = $Bio::SearchIO::blast::MAX_HSP_OVERLAP.
374 The following criteria are not yet supported:
375 (these are probably best applied within this module rather than in the
376 event handler since they would permit the parser to take some shortcuts.)
378 -check_all_hits => boolean. Check all hits for significance against
379 significance criteria. Default = false.
380 If false, stops processing hits after the first
381 non-significant hit or the first hit that fails
382 the hit_filter call. This speeds parsing,
383 taking advantage of the fact that the hits
384 are processed in the order they appear in the report.
385 -min_query_len => integer to be used as a minimum for query sequence length.
386 Reports with query sequences below this length will
387 not be processed. Default = no minimum length.
388 -best => boolean. Only process the best hit of each report;
389 default = false.
391 =cut
393 sub _initialize {
394 my ( $self, @args ) = @_;
395 $self->SUPER::_initialize(@args);
397 # Blast reports require a specialized version of the SREB due to the
398 # possibility of iterations (PSI-BLAST). Forwarding all arguments to it. An
399 # issue here is that we want to set new default object factories if none are
400 # supplied.
402 my $handler = Bio::SearchIO::IteratedSearchResultEventBuilder->new(@args);
403 $self->attach_EventHandler($handler);
405 # 2006-04-26 move this to the attach_handler function in this module so we
406 # can really reset the handler
407 # Optimization: caching
408 # the EventHandler since it is used a lot during the parse.
410 # $self->{'_handler_cache'} = $handler;
412 my ($rpttype ) = $self->_rearrange(
415 REPORT_TYPE)
417 @args
419 defined $rpttype && ( $self->{'_reporttype'} = $rpttype );
422 sub attach_EventHandler {
423 my ($self,$handler) = @_;
425 $self->SUPER::attach_EventHandler($handler);
427 # Optimization: caching the EventHandler since it is used a lot
428 # during the parse.
430 $self->{'_handler_cache'} = $handler;
431 return;
434 =head2 next_result
436 Title : next_result
437 Usage : my $hit = $searchio->next_result;
438 Function: Returns the next Result from a search
439 Returns : Bio::Search::Result::ResultI object
440 Args : none
442 =cut
444 sub next_result {
445 my ($self) = @_;
446 my $v = $self->verbose;
447 my $data = '';
448 my $flavor = '';
449 $self->{'_seentop'} = 0; # start next report at top
451 my ( $reporttype, $seenquery, $reportline, $reportversion );
452 my ( $seeniteration, $found_again );
453 my $incl_threshold = $self->inclusion_threshold;
454 my $bl2seq_fix;
455 $self->start_document(); # let the fun begin...
456 my (@hit_signifs);
457 my $gapped_stats = 0; # for switching between gapped/ungapped
458 # lambda, K, H
459 local $_ = "\n"; #consistency
460 PARSER:
461 while ( defined( $_ = $self->_readline ) ) {
462 next if (/^\s+$/); # skip empty lines
463 next if (/CPU time:/);
464 next if (/^>\s*$/);
465 next if (/[*]+\s+No hits found\s+[*]+/);
466 if (
467 /^((?:\S+?)?BLAST[NPX]?)\s+(.+)$/i # NCBI BLAST, PSIBLAST
468 # RPSBLAST, MEGABLAST
469 || /^(P?GENEWISE|HFRAME|SWN|TSWN)\s+(.+)/i #Paracel BTK
472 ($reporttype, $reportversion) = ($1, $2);
473 # need to keep track of whether this is WU-BLAST
474 if ($reportversion && $reportversion =~ m{WashU$}) {
475 $self->{'_wublast'}++;
477 $self->debug("blast.pm: Start of new report: $reporttype, $reportversion\n");
478 if ( $self->{'_seentop'} ) {
479 # This handles multi-result input streams
480 $self->_pushback($_);
481 last PARSER;
483 $self->_start_blastoutput;
484 if ($reporttype =~ /RPS-BLAST/) {
485 $reporttype .= '(BLASTP)'; # default RPS-BLAST type
487 $reportline = $_; # to fix the fact that RPS-BLAST output is wrong
488 $self->element(
490 'Name' => 'BlastOutput_program',
491 'Data' => $reporttype
495 $self->element(
497 'Name' => 'BlastOutput_version',
498 'Data' => $reportversion
501 $self->element(
503 'Name' => 'BlastOutput_inclusion-threshold',
504 'Data' => $incl_threshold
508 # parse the BLAST algorithm reference
509 elsif(/^Reference:\s+(.*)$/) {
510 # want to preserve newlines for the BLAST algorithm reference
511 my $algorithm_reference = "$1\n";
512 $_ = $self->_readline;
513 # while the current line, does not match an empty line, a RID:, a
514 # Database:, or a query definition line (Query=) we are still
515 # looking at the algorithm_reference, append it to what we parsed so
516 # far
517 while($_ !~ /^$/ && $_ !~ /^RID:/ && $_ !~ /^Database:/ && $_ !~ /^Query=/) {
518 $algorithm_reference .= "$_";
519 $_ = $self->_readline;
521 # if we exited the while loop, we saw an empty line, a RID:, or a
522 # Database:, so push it back
523 $self->_pushback($_);
524 $self->element(
526 'Name' => 'BlastOutput_algorithm-reference',
527 'Data' => $algorithm_reference
531 # parse BLAST RID (Request ID)
532 elsif(/^RID:\s+(.*)$/) {
533 my $rid = $1;
534 $self->element(
536 'Name' => 'BlastOutput_rid',
537 'Data' => $rid
541 # added Windows workaround for bug 1985
542 elsif (/^(Searching|Results from round)/) {
543 next unless $1 =~ /Results from round/;
544 $self->debug("blast.pm: Possible psi blast iterations found...\n");
546 $self->in_element('hsp')
547 && $self->end_element( { 'Name' => 'Hsp' } );
548 $self->in_element('hit')
549 && $self->end_element( { 'Name' => 'Hit' } );
550 if ( defined $seeniteration ) {
551 $self->within_element('iteration')
552 && $self->end_element( { 'Name' => 'Iteration' } );
553 $self->start_element( { 'Name' => 'Iteration' } );
555 else {
556 $self->start_element( { 'Name' => 'Iteration' } );
558 $seeniteration = 1;
560 elsif (/^Query=\s*(.*)$/) {
561 my $q = $1;
562 $self->debug("blast.pm: Query= found...$_\n");
563 my $size = 0;
564 if ( defined $seenquery ) {
565 $self->_pushback($_);
566 $self->_pushback($reportline) if $reportline;
567 last PARSER;
569 if ( !defined $reporttype ) {
570 $self->_start_blastoutput;
571 if ( defined $seeniteration ) {
572 $self->in_element('iteration')
573 && $self->end_element( { 'Name' => 'Iteration' } );
574 $self->start_element( { 'Name' => 'Iteration' } );
576 else {
577 $self->start_element( { 'Name' => 'Iteration' } );
579 $seeniteration = 1;
581 $seenquery = $q;
582 $_ = $self->_readline;
583 while ( defined($_) ) {
584 if (/^Database:/) {
585 $self->_pushback($_);
586 last;
588 # below line fixes length issue with BLAST v2.2.13; still works
589 # with BLAST v2.2.12
590 if ( /\((\-?[\d,]+)\s+letters.*\)/ || /^Length=(\-?[\d,]+)/ ) {
591 $size = $1;
592 $size =~ s/,//g;
593 last;
595 else {
596 # bug 2391
597 $q .= ($q =~ /\w$/ && $_ =~ /^\w/) ? " $_" : $_;
598 $q =~ s/\s+/ /g; # this catches the newline as well
599 $q =~ s/^ | $//g;
602 $_ = $self->_readline;
604 chomp($q);
605 my ( $nm, $desc ) = split( /\s+/, $q, 2 );
606 $self->element(
608 'Name' => 'BlastOutput_query-def',
609 'Data' => $nm
611 ) if $nm;
612 $self->element(
614 'Name' => 'BlastOutput_query-len',
615 'Data' => $size
618 defined $desc && $desc =~ s/\s+$//;
619 $self->element(
621 'Name' => 'BlastOutput_querydesc',
622 'Data' => $desc
625 my ( $gi, $acc, $version ) = $self->_get_seq_identifiers($nm);
626 $version = defined($version) && length($version) ? ".$version" : "";
627 $self->element(
629 'Name' => 'BlastOutput_query-acc',
630 'Data' => "$acc$version"
632 ) if $acc;
634 # these elements are dropped with some multiquery reports; add
635 # back here
636 $self->element(
638 'Name' => 'BlastOutput_db-len',
639 'Data' => $self->{'_blsdb_length'}
641 ) if $self->{'_blsdb_length'};
642 $self->element(
644 'Name' => 'BlastOutput_db-let',
645 'Data' => $self->{'_blsdb_letters'}
647 ) if $self->{'_blsdb_letters'};
648 $self->element(
650 'Name' => 'BlastOutput_db',
651 'Data' => $self->{'_blsdb'}
653 ) if $self->{'_blsdb_letters'};
655 # added check for WU-BLAST -echofilter option (bug 2388)
656 elsif (/^>Unfiltered[+-]1$/) {
657 # skip all of the lines of unfiltered sequence
658 while($_ !~ /^Database:/) {
659 $self->debug("Bypassing features line: $_");
660 $_ = $self->_readline;
662 $self->_pushback($_);
664 elsif (/Sequences producing significant alignments:/) {
665 $self->debug("blast.pm: Processing NCBI-BLAST descriptions\n");
666 $flavor = 'ncbi';
668 # PSI-BLAST parsing needs to be fixed to specifically look
669 # for old vs new per iteration, as sorting based on duplication
670 # leads to bugs, see bug 1986
672 # The next line is not necessarily whitespace in psiblast reports.
673 # Also note that we must look for the end of this section by testing
674 # for a line with a leading >. Blank lines occur with this section
675 # for psiblast.
676 if ( !$self->in_element('iteration') ) {
677 $self->start_element( { 'Name' => 'Iteration' } );
680 # changed 8/28/2008 to exit hit table if blank line is found after an
681 # appropriate line
682 my $h_regex;
683 my $seen_block;
684 DESCLINE:
685 while ( defined( my $descline = $self->_readline() ) ) {
686 if ($descline =~ m{^\s*$}) {
687 last DESCLINE if $seen_block;
688 next DESCLINE;
690 # any text match is part of block...
691 $seen_block++;
692 # GCG multiline oddness...
693 if ($descline =~ /^(\S+)\s+Begin:\s\d+\s+End:\s+\d+/xms) {
694 my ($id, $nextline) = ($1, $self->_readline);
695 $nextline =~ s{^!}{};
696 $descline = "$id $nextline";
698 # NCBI style hit table (no N)
699 if ($descline =~ /(?<!cor) # negative lookahead
700 (\d*\.?(?:[\+\-eE]+)?\d+) # number (float or scientific notation)
701 \s+ # space
702 (\d*\.?(?:[\+\-eE]+)?\d+) # number (float or scientific notation)
703 \s*$/xms) {
705 my ( $score, $evalue ) = ($1, $2);
707 # Some data clean-up so e-value will appear numeric to perl
708 $evalue =~ s/^e/1e/i;
710 # This to handle no-HSP case
711 my @line = split ' ',$descline;
713 # we want to throw away the score, evalue
714 pop @line, pop @line;
716 # and N if it is present (of course they are not
717 # really in that order, but it doesn't matter
718 if ($3) { pop @line }
720 # add the last 2 entries s.t. we can reconstruct
721 # a minimal Hit object at the end of the day
722 push @hit_signifs, [ $evalue, $score, shift @line, join( ' ', @line ) ];
723 } elsif ($descline =~ /^CONVERGED/i) {
724 $self->element(
726 'Name' => 'Iteration_converged',
727 'Data' => 1
730 } else {
731 $self->_pushback($descline); # Catch leading > (end of section)
732 last DESCLINE;
736 elsif (/Sequences producing High-scoring Segment Pairs:/) {
738 # This block is for WU-BLAST, so we don't have to check for psi-blast stuff
739 # skip the next line
740 $self->debug("blast.pm: Processing WU-BLAST descriptions\n");
741 $_ = $self->_readline();
742 $flavor = 'wu';
744 if ( !$self->in_element('iteration') ) {
745 $self->start_element( { 'Name' => 'Iteration' } );
748 while ( defined( $_ = $self->_readline() )
749 && !/^\s+$/ )
751 my @line = split;
752 pop @line; # throw away first number which is for 'N'col
754 # add the last 2 entries to array s.t. we can reconstruct
755 # a minimal Hit object at the end of the day
756 push @hit_signifs,
757 [ pop @line, pop @line, shift @line, join( ' ', @line ) ];
761 elsif (/^Database:\s*(.+?)\s*$/) {
763 $self->debug("blast.pm: Database: $1\n");
764 my $db = $1;
765 while ( defined( $_ = $self->_readline ) ) {
766 if (
767 /^\s+(\-?[\d\,]+|\S+)\s+sequences\;
768 \s+(\-?[\d,]+|\S+)\s+ # Deal with NCBI 2.2.8 OSX problems
769 total\s+letters/ox
772 my ( $s, $l ) = ( $1, $2 );
773 $s =~ s/,//g;
774 $l =~ s/,//g;
775 $self->element(
777 'Name' => 'BlastOutput_db-len',
778 'Data' => $s
781 $self->element(
783 'Name' => 'BlastOutput_db-let',
784 'Data' => $l
787 # cache for next round in cases with multiple queries
788 $self->{'_blsdb'} = $db;
789 $self->{'_blsdb_length'} = $s;
790 $self->{'_blsdb_letters'} = $l;
791 last;
793 else {
794 chomp;
795 $db .= $_;
798 $self->element(
800 'Name' => 'BlastOutput_db',
801 'Data' => $db
806 # move inside of a hit
807 elsif (/^>\s*(\S+)\s*(.*)?/) {
808 chomp;
810 $self->debug("blast.pm: Hit: $1\n");
811 $self->in_element('hsp')
812 && $self->end_element( { 'Name' => 'Hsp' } );
813 $self->in_element('hit')
814 && $self->end_element( { 'Name' => 'Hit' } );
816 # special case when bl2seq reports don't have a leading
817 # Query=
818 if ( !$self->within_element('result') ) {
819 $self->_start_blastoutput;
820 $self->start_element( { 'Name' => 'Iteration' } );
822 elsif ( !$self->within_element('iteration') ) {
823 $self->start_element( { 'Name' => 'Iteration' } );
825 $self->start_element( { 'Name' => 'Hit' } );
826 my $id = $1;
827 my $restofline = $2;
829 $self->debug("Starting a hit: $1 $2\n");
830 $self->element(
832 'Name' => 'Hit_id',
833 'Data' => $id
836 my ($gi, $acc, $version ) = $self->_get_seq_identifiers($id);
837 $self->element(
839 'Name' => 'Hit_accession',
840 'Data' => $acc
843 # add hit significance (from the hit table)
844 # this is where Bug 1986 went awry
846 # Changed for Bug2409; hit->significance and hit->score/bits derived
847 # from HSPs, not hit table unless necessary
849 HITTABLE:
850 while (my $v = shift @hit_signifs) {
851 my $tableid = $v->[2];
852 if ($tableid !~ m{\Q$id\E}) {
853 $self->debug("Hit table ID $tableid doesn't match current hit id $id, checking next hit table entry...\n");
854 next HITTABLE;
855 } else {
856 last HITTABLE;
859 while ( defined( $_ = $self->_readline() ) ) {
860 next if (/^\s+$/);
861 chomp;
862 if (/Length\s*=\s*([\d,]+)/) {
863 my $l = $1;
864 $l =~ s/\,//g;
865 $self->element(
867 'Name' => 'Hit_len',
868 'Data' => $l
871 last;
873 else {
874 if ($restofline !~ /\s$/) { # bug #3235
875 s/^\s(?!\s)/\x01/; #new line to concatenate desc lines with <soh>
877 $restofline .= $_;
880 $restofline =~ s/\s+/ /g;
881 $self->element(
883 'Name' => 'Hit_def',
884 'Data' => $restofline
888 elsif (/\s+(Plus|Minus) Strand HSPs:/i) {
889 next;
891 elsif (
892 ( $self->in_element('hit') || $self->in_element('hsp') )
893 && # paracel genewise BTK
894 m/Score\s*=\s*(\S+)\s*bits\s* # Bit score
895 (?:\((\d+)\))?, # Raw score
896 \s+Log\-Length\sScore\s*=\s*(\d+) # Log-Length score
900 $self->in_element('hsp')
901 && $self->end_element( { 'Name' => 'Hsp' } );
902 $self->start_element( { 'Name' => 'Hsp' } );
904 $self->debug( "Got paracel genewise HSP score=$1\n");
906 # Some data clean-up so e-value will appear numeric to perl
907 my ( $bits, $score, $evalue ) = ( $1, $2, $3 );
908 $evalue =~ s/^e/1e/i;
909 $self->element(
911 'Name' => 'Hsp_score',
912 'Data' => $score
915 $self->element(
917 'Name' => 'Hsp_bit-score',
918 'Data' => $bits
921 $self->element(
923 'Name' => 'Hsp_evalue',
924 'Data' => $evalue
928 elsif (
929 ( $self->in_element('hit') || $self->in_element('hsp') )
930 && # paracel hframe BTK
931 m/Score\s*=\s*([^,\s]+), # Raw score
932 \s*Expect\s*=\s*([^,\s]+), # E-value
933 \s*P(?:\(\S+\))?\s*=\s*([^,\s]+) # P-value
937 $self->in_element('hsp')
938 && $self->end_element( { 'Name' => 'Hsp' } );
939 $self->start_element( { 'Name' => 'Hsp' } );
941 $self->debug( "Got paracel hframe HSP score=$1\n");
943 # Some data clean-up so e-value will appear numeric to perl
944 my ( $score, $evalue, $pvalue ) = ( $1, $2, $3 );
945 $evalue = "1$evalue" if $evalue =~ /^e/;
946 $pvalue = "1$pvalue" if $pvalue =~ /^e/;
948 $self->element(
950 'Name' => 'Hsp_score',
951 'Data' => $score
954 $self->element(
956 'Name' => 'Hsp_evalue',
957 'Data' => $evalue
960 $self->element(
962 'Name' => 'Hsp_pvalue',
963 'Data' => $pvalue
967 elsif (
968 ( $self->in_element('hit') || $self->in_element('hsp') )
969 && # wublast
970 m/Score\s*=\s*(\S+)\s* # Bit score
971 \(([\d\.]+)\s*bits\), # Raw score
972 \s*Expect\s*=\s*([^,\s]+), # E-value
973 \s*(?:Sum)?\s* # SUM
974 P(?:\(\d+\))?\s*=\s*([^,\s]+) # P-value
975 (?:\s*,\s+Group\s*\=\s*(\d+))? # HSP Group
978 { # wu-blast HSP parse
979 $self->in_element('hsp')
980 && $self->end_element( { 'Name' => 'Hsp' } );
981 $self->start_element( { 'Name' => 'Hsp' } );
983 # Some data clean-up so e-value will appear numeric to perl
984 my ( $score, $bits, $evalue, $pvalue, $group ) =
985 ( $1, $2, $3, $4, $5 );
986 $evalue =~ s/^e/1e/i;
987 $pvalue =~ s/^e/1e/i;
989 $self->element(
991 'Name' => 'Hsp_score',
992 'Data' => $score
995 $self->element(
997 'Name' => 'Hsp_bit-score',
998 'Data' => $bits
1001 $self->element(
1003 'Name' => 'Hsp_evalue',
1004 'Data' => $evalue
1007 $self->element(
1009 'Name' => 'Hsp_pvalue',
1010 'Data' => $pvalue
1014 if ( defined $group ) {
1015 $self->element(
1017 'Name' => 'Hsp_group',
1018 'Data' => $group
1024 elsif (
1025 ( $self->in_element('hit') || $self->in_element('hsp') )
1026 && # ncbi blast, works with 2.2.17
1027 m/^\sFeatures\s\w+\sthis\spart/xmso
1028 # If the line begins with "Features in/flanking this part of subject sequence:"
1031 $self->in_element('hsp')
1032 && $self->end_element( { 'Name' => 'Hsp' } );
1033 my $featline;
1034 $_ = $self->_readline;
1035 while($_ !~ /^\s*$/) {
1036 chomp;
1037 $featline .= $_;
1038 $_ = $self->_readline;
1040 $self->_pushback($_);
1041 $featline =~ s{(?:^\s+|\s+^)}{}g;
1042 $featline =~ s{\n}{;}g;
1043 $self->start_element( { 'Name' => 'Hsp' } );
1044 $self->element(
1046 'Name' => 'Hsp_features',
1047 'Data' => $featline
1050 $self->{'_seen_hsp_features'} = 1;
1052 elsif (
1053 ( $self->in_element('hit') || $self->in_element('hsp') )
1054 && # ncbi blast, works with 2.2.17
1055 m/Score\s*=\s*(\S+)\s*bits\s* # Bit score
1056 (?:\((\d+)\))?, # Missing for BLAT pseudo-BLAST fmt
1057 \s*Expect(?:\((\d+\+?)\))?\s*=\s*([^,\s]+) # E-value
1060 { # parse NCBI blast HSP
1061 if( !$self->{'_seen_hsp_features'} ) {
1062 $self->in_element('hsp')
1063 && $self->end_element( { 'Name' => 'Hsp' } );
1064 $self->start_element( { 'Name' => 'Hsp' } );
1067 # Some data clean-up so e-value will appear numeric to perl
1068 my ( $bits, $score, $n, $evalue ) = ( $1, $2, $3, $4 );
1069 $evalue =~ s/^e/1e/i;
1070 $self->element(
1072 'Name' => 'Hsp_score',
1073 'Data' => $score
1076 $self->element(
1078 'Name' => 'Hsp_bit-score',
1079 'Data' => $bits
1082 $self->element(
1084 'Name' => 'Hsp_evalue',
1085 'Data' => $evalue
1088 $self->element(
1090 'Name' => 'Hsp_n',
1091 'Data' => $n
1093 ) if defined $n;
1094 $score = '' unless defined $score; # deal with BLAT which
1095 # has no score only bits
1096 $self->debug("Got NCBI HSP score=$score, evalue $evalue\n");
1098 elsif (
1099 $self->in_element('hsp')
1100 && m/Identities\s*=\s*(\d+)\s*\/\s*(\d+)\s*[\d\%\(\)]+\s*
1101 (?:,\s*Positives\s*=\s*(\d+)\/(\d+)\s*[\d\%\(\)]+\s*)? # pos only valid for Protein alignments
1102 (?:\,\s*Gaps\s*=\s*(\d+)\/(\d+))? # Gaps
1103 /oxi
1106 $self->element(
1108 'Name' => 'Hsp_identity',
1109 'Data' => $1
1112 $self->element(
1114 'Name' => 'Hsp_align-len',
1115 'Data' => $2
1118 if ( defined $3 ) {
1119 $self->element(
1121 'Name' => 'Hsp_positive',
1122 'Data' => $3
1126 else {
1127 $self->element(
1129 'Name' => 'Hsp_positive',
1130 'Data' => $1
1134 if ( defined $6 ) {
1135 $self->element(
1137 'Name' => 'Hsp_gaps',
1138 'Data' => $5
1143 $self->{'_Query'} = { 'begin' => 0, 'end' => 0 };
1144 $self->{'_Sbjct'} = { 'begin' => 0, 'end' => 0 };
1146 if (/(Frame\s*=\s*.+)$/) {
1148 # handle wu-blast Frame listing on same line
1149 $self->_pushback($1);
1152 elsif ( $self->in_element('hsp')
1153 && /Strand\s*=\s*(Plus|Minus)\s*\/\s*(Plus|Minus)/i )
1156 # consume this event ( we infer strand from start/end)
1157 if (!defined($reporttype)) {
1158 $self->{'_reporttype'} = $reporttype = 'BLASTN';
1159 $bl2seq_fix = 1; # special case to resubmit the algorithm
1160 # reporttype
1162 next;
1164 elsif ( $self->in_element('hsp')
1165 && /Links\s*=\s*(\S+)/ox )
1167 $self->element(
1169 'Name' => 'Hsp_links',
1170 'Data' => $1
1174 elsif ( $self->in_element('hsp')
1175 && /Frame\s*=\s*([\+\-][1-3])\s*(\/\s*([\+\-][1-3]))?/ )
1178 # this is for bl2seq only
1179 unless ( defined $reporttype ) {
1180 $bl2seq_fix = 1;
1181 if ( $1 && $2 ) { $reporttype = 'TBLASTX' }
1182 else {
1183 $reporttype = 'BLASTX';
1185 # we can't distinguish between BLASTX and TBLASTN straight from the report }
1187 $self->{'_reporttype'} = $reporttype;
1190 my ( $queryframe, $hitframe );
1191 if ( $reporttype eq 'TBLASTX' ) {
1192 ( $queryframe, $hitframe ) = ( $1, $2 );
1193 $hitframe =~ s/\/\s*//g;
1195 elsif ( $reporttype eq 'TBLASTN' || $reporttype eq 'PSITBLASTN') {
1196 ( $hitframe, $queryframe ) = ( $1, 0 );
1198 elsif ( $reporttype eq 'BLASTX' || $reporttype eq 'RPS-BLAST(BLASTP)') {
1199 ( $queryframe, $hitframe ) = ( $1, 0 );
1200 # though NCBI doesn't report it, this is a special BLASTX-like
1201 # RPS-BLAST; should be handled differently
1202 if ($reporttype eq 'RPS-BLAST(BLASTP)') {
1203 $self->element(
1205 'Name' => 'BlastOutput_program',
1206 'Data' => 'RPS-BLAST(BLASTX)'
1211 $self->element(
1213 'Name' => 'Hsp_query-frame',
1214 'Data' => $queryframe
1218 $self->element(
1220 'Name' => 'Hsp_hit-frame',
1221 'Data' => $hitframe
1225 elsif (/^Parameters:/
1226 || /^\s+Database:\s+?/
1227 || /^\s+Subset/
1228 || /^\s*Lambda/
1229 || /^\s*Histogram/
1230 || ( $self->in_element('hsp') && /WARNING|NOTE/ ) )
1233 # Note: Lambda check was necessary to parse
1234 # t/data/ecoli_domains.rpsblast AND to parse bl2seq
1235 $self->debug("blast.pm: found parameters section \n");
1237 $self->in_element('hsp')
1238 && $self->end_element( { 'Name' => 'Hsp' } );
1239 $self->in_element('hit')
1240 && $self->end_element( { 'Name' => 'Hit' } );
1242 # This is for the case when we specify -b 0 (or B=0 for WU-BLAST)
1243 # and still want to construct minimal Hit objects
1244 $self->_cleanup_hits(\@hit_signifs) if scalar(@hit_signifs);
1245 $self->within_element('iteration')
1246 && $self->end_element( { 'Name' => 'Iteration' } );
1248 next if /^\s+Subset/;
1249 my $blast = (/^(\s+Database\:)|(\s*Lambda)/) ? 'ncbi' : 'wublast';
1250 if (/^\s*Histogram/) {
1251 $blast = 'btk';
1254 my $last = '';
1256 # default is that gaps are allowed
1257 $self->element(
1259 'Name' => 'Parameters_allowgaps',
1260 'Data' => 'yes'
1263 while ( defined( $_ = $self->_readline ) ) {
1264 if (
1265 /^((?:\S+)?BLAST[NPX]?)\s+(.+)$/i # NCBI BLAST, PSIBLAST
1266 # RPSBLAST, MEGABLAST
1267 || /^(P?GENEWISE|HFRAME|SWN|TSWN)\s+(.+)/i #Paracel BTK
1270 $self->_pushback($_);
1272 # let's handle this in the loop
1273 last;
1275 elsif (/^Query=/) {
1276 $self->_pushback($_);
1277 $self->_pushback($reportline) if $reportline;
1278 last PARSER;
1281 # here is where difference between wublast and ncbiblast
1282 # is better handled by different logic
1283 if ( /Number of Sequences:\s+([\d\,]+)/i
1284 || /of sequences in database:\s+(\-?[\d,]+)/i )
1286 my $c = $1;
1287 $c =~ s/\,//g;
1288 $self->element(
1290 'Name' => 'Statistics_db-len',
1291 'Data' => $c
1295 elsif (/letters in database:\s+(\-?[\d,]+)/i) {
1296 my $s = $1;
1297 $s =~ s/,//g;
1298 $self->element(
1300 'Name' => 'Statistics_db-let',
1301 'Data' => $s
1305 elsif ( $blast eq 'btk' ) {
1306 next;
1308 elsif ( $blast eq 'wublast' ) {
1310 # warn($_);
1311 if (/E=(\S+)/) {
1312 $self->element(
1314 'Name' => 'Parameters_expect',
1315 'Data' => $1
1319 elsif (/nogaps/) {
1320 $self->element(
1322 'Name' => 'Parameters_allowgaps',
1323 'Data' => 'no'
1327 elsif (/ctxfactor=(\S+)/) {
1328 $self->element(
1330 'Name' => 'Statistics_ctxfactor',
1331 'Data' => $1
1335 elsif (
1336 /(postsw|links|span[12]?|warnings|notes|gi|noseqs|qres|qype)/
1339 $self->element(
1341 'Name' => "Parameters_$1",
1342 'Data' => 'yes'
1346 elsif (/(\S+)=(\S+)/) {
1347 $self->element(
1349 'Name' => "Parameters_$1",
1350 'Data' => $2
1354 elsif ( $last =~ /(Frame|Strand)\s+MatID\s+Matrix name/i ) {
1355 my $firstgapinfo = 1;
1356 my $frame = undef;
1357 while ( defined($_) && !/^\s+$/ ) {
1358 s/^\s+//;
1359 s/\s+$//;
1360 if ( $firstgapinfo
1361 && s/Q=(\d+),R=(\d+)\s+//x )
1363 $firstgapinfo = 0;
1365 $self->element(
1367 'Name' => 'Parameters_gap-open',
1368 'Data' => $1
1371 $self->element(
1373 'Name' => 'Parameters_gap-extend',
1374 'Data' => $2
1377 my @fields = split;
1379 for my $type (
1380 qw(lambda_gapped
1381 kappa_gapped
1382 entropy_gapped)
1385 next if $type eq 'n/a';
1386 if ( !@fields ) {
1387 warn "fields is empty for $type\n";
1388 next;
1390 $self->element(
1392 'Name' =>
1393 "Statistics_frame$frame\_$type",
1394 'Data' => shift @fields
1399 else {
1400 my ( $frameo, $matid, $matrix, @fields ) =
1401 split;
1402 if ( !defined $frame ) {
1404 # keep some sort of default feature I guess
1405 # even though this is sort of wrong
1406 $self->element(
1408 'Name' => 'Parameters_matrix',
1409 'Data' => $matrix
1412 $self->element(
1414 'Name' => 'Statistics_lambda',
1415 'Data' => $fields[0]
1418 $self->element(
1420 'Name' => 'Statistics_kappa',
1421 'Data' => $fields[1]
1424 $self->element(
1426 'Name' => 'Statistics_entropy',
1427 'Data' => $fields[2]
1431 $frame = $frameo;
1432 my $ii = 0;
1433 for my $type (
1434 qw(lambda_used
1435 kappa_used
1436 entropy_used
1437 lambda_computed
1438 kappa_computed
1439 entropy_computed)
1442 my $f = $fields[$ii];
1443 next unless defined $f; # deal with n/a
1444 if ( $f eq 'same' ) {
1445 $f = $fields[ $ii - 3 ];
1447 $ii++;
1448 $self->element(
1450 'Name' =>
1451 "Statistics_frame$frame\_$type",
1452 'Data' => $f
1459 # get the next line
1460 $_ = $self->_readline;
1462 $last = $_;
1464 elsif ( $last =~ /(Frame|Strand)\s+MatID\s+Length/i ) {
1465 my $frame = undef;
1466 while ( defined($_) && !/^\s+/ ) {
1467 s/^\s+//;
1468 s/\s+$//;
1469 my @fields = split;
1470 if ( @fields <= 3 ) {
1471 for my $type (qw(X_gapped E2_gapped S2)) {
1472 last unless @fields;
1473 $self->element(
1475 'Name' =>
1476 "Statistics_frame$frame\_$type",
1477 'Data' => shift @fields
1482 else {
1484 for my $type (
1485 qw(length
1486 efflength
1487 E S W T X E2 S2)
1490 $self->element(
1492 'Name' =>
1493 "Statistics_frame$frame\_$type",
1494 'Data' => shift @fields
1499 $_ = $self->_readline;
1501 $last = $_;
1503 elsif (/(\S+\s+\S+)\s+DFA:\s+(\S+)\s+\((.+)\)/) {
1504 if ( $1 eq 'states in' ) {
1505 $self->element(
1507 'Name' => 'Statistics_DFA_states',
1508 'Data' => "$2 $3"
1512 elsif ( $1 eq 'size of' ) {
1513 $self->element(
1515 'Name' => 'Statistics_DFA_size',
1516 'Data' => "$2 $3"
1521 elsif (
1522 m/^\s+Time to generate neighborhood:\s+
1523 (\S+\s+\S+\s+\S+)/x
1526 $self->element(
1528 'Name' => 'Statistics_neighbortime',
1529 'Data' => $1
1533 elsif (/processors\s+used:\s+(\d+)/) {
1534 $self->element(
1536 'Name' => 'Statistics_noprocessors',
1537 'Data' => $1
1541 elsif (
1542 m/^\s+(\S+)\s+cpu\s+time:\s+# cputype
1543 (\S+\s+\S+\s+\S+) # cputime
1544 \s+Elapsed:\s+(\S+)/x
1547 my $cputype = lc($1);
1548 $self->element(
1550 'Name' => "Statistics_$cputype\_cputime",
1551 'Data' => $2
1554 $self->element(
1556 'Name' => "Statistics_$cputype\_actualtime",
1557 'Data' => $3
1561 elsif (/^\s+Start:/) {
1562 my ( $junk, $start, $stime, $end, $etime ) =
1563 split( /\s+(Start|End)\:\s+/, $_ );
1564 chomp($stime);
1565 $self->element(
1567 'Name' => 'Statistics_starttime',
1568 'Data' => $stime
1571 chomp($etime);
1572 $self->element(
1574 'Name' => 'Statistics_endtime',
1575 'Data' => $etime
1579 elsif (/^\s+Database:\s+(.+)$/) {
1580 $self->element(
1582 'Name' => 'Parameters_full_dbpath',
1583 'Data' => $1
1588 elsif (/^\s+Posted:\s+(.+)/) {
1589 my $d = $1;
1590 chomp($d);
1591 $self->element(
1593 'Name' => 'Statistics_posted_date',
1594 'Data' => $d
1599 elsif ( $blast eq 'ncbi' ) {
1601 if (m/^Matrix:\s+(.+)\s*$/oxi) {
1602 $self->element(
1604 'Name' => 'Parameters_matrix',
1605 'Data' => $1
1609 elsif (/^Gapped/) {
1610 $gapped_stats = 1;
1612 elsif (/^Lambda/) {
1613 $_ = $self->_readline;
1614 s/^\s+//;
1615 my ( $lambda, $kappa, $entropy ) = split;
1616 if ($gapped_stats) {
1617 $self->element(
1619 'Name' => "Statistics_gapped_lambda",
1620 'Data' => $lambda
1623 $self->element(
1625 'Name' => "Statistics_gapped_kappa",
1626 'Data' => $kappa
1629 $self->element(
1631 'Name' => "Statistics_gapped_entropy",
1632 'Data' => $entropy
1636 else {
1637 $self->element(
1639 'Name' => "Statistics_lambda",
1640 'Data' => $lambda
1643 $self->element(
1645 'Name' => "Statistics_kappa",
1646 'Data' => $kappa
1649 $self->element(
1651 'Name' => "Statistics_entropy",
1652 'Data' => $entropy
1657 elsif (m/effective\s+search\s+space\s+used:\s+(\d+)/oxi) {
1658 $self->element(
1660 'Name' => 'Statistics_eff-spaceused',
1661 'Data' => $1
1665 elsif (m/effective\s+search\s+space:\s+(\d+)/oxi) {
1666 $self->element(
1668 'Name' => 'Statistics_eff-space',
1669 'Data' => $1
1673 elsif (
1674 m/Gap\s+Penalties:\s+Existence:\s+(\d+)\,
1675 \s+Extension:\s+(\d+)/ox
1678 $self->element(
1680 'Name' => 'Parameters_gap-open',
1681 'Data' => $1
1684 $self->element(
1686 'Name' => 'Parameters_gap-extend',
1687 'Data' => $2
1691 elsif (/effective\s+HSP\s+length:\s+(\d+)/) {
1692 $self->element(
1694 'Name' => 'Statistics_hsp-len',
1695 'Data' => $1
1699 elsif (/Number\s+of\s+HSP's\s+better\s+than\s+(\S+)\s+without\s+gapping:\s+(\d+)/) {
1700 $self->element(
1702 'Name' => 'Statistics_number_of_hsps_better_than_expect_value_cutoff_without_gapping',
1703 'Data' => $2
1707 elsif (/Number\s+of\s+HSP's\s+gapped:\s+(\d+)/) {
1708 $self->element(
1710 'Name' => 'Statistics_number_of_hsps_gapped',
1711 'Data' => $1
1715 elsif (/Number\s+of\s+HSP's\s+successfully\s+gapped:\s+(\d+)/) {
1716 $self->element(
1718 'Name' => 'Statistics_number_of_hsps_successfully_gapped',
1719 'Data' => $1
1723 elsif (/Length\s+adjustment:\s+(\d+)/) {
1724 $self->element(
1726 'Name' => 'Statistics_length_adjustment',
1727 'Data' => $1
1731 elsif (/effective\s+length\s+of\s+query:\s+([\d\,]+)/i) {
1732 my $c = $1;
1733 $c =~ s/\,//g;
1734 $self->element(
1736 'Name' => 'Statistics_query-len',
1737 'Data' => $c
1741 elsif (/effective\s+length\s+of\s+database:\s+([\d\,]+)/i) {
1742 my $c = $1;
1743 $c =~ s/\,//g;
1744 $self->element(
1746 'Name' => 'Statistics_eff-dblen',
1747 'Data' => $c
1751 elsif (
1752 /^(T|A|X1|X2|X3|S1|S2):\s+(\d+(\.\d+)?)\s+(?:\(\s*(\d+\.\d+) bits\))?/
1755 my $v = $2;
1756 chomp($v);
1757 $self->element(
1759 'Name' => "Statistics_$1",
1760 'Data' => $v
1763 if ( defined $4 ) {
1764 $self->element(
1766 'Name' => "Statistics_$1_bits",
1767 'Data' => $4
1772 elsif (
1773 m/frameshift\s+window\,
1774 \s+decay\s+const:\s+(\d+)\,\s+([\.\d]+)/x
1777 $self->element(
1779 'Name' => 'Statistics_framewindow',
1780 'Data' => $1
1783 $self->element(
1785 'Name' => 'Statistics_decay',
1786 'Data' => $2
1790 elsif (m/^Number\s+of\s+Hits\s+to\s+DB:\s+(\S+)/ox) {
1791 $self->element(
1793 'Name' => 'Statistics_hit_to_db',
1794 'Data' => $1
1798 elsif (m/^Number\s+of\s+extensions:\s+(\S+)/ox) {
1799 $self->element(
1801 'Name' => 'Statistics_num_extensions',
1802 'Data' => $1
1806 elsif (
1807 m/^Number\s+of\s+successful\s+extensions:\s+
1808 (\S+)/ox
1811 $self->element(
1813 'Name' => 'Statistics_num_suc_extensions',
1814 'Data' => $1
1818 elsif (
1819 m/^Number\s+of\s+sequences\s+better\s+than\s+
1820 (\S+):\s+(\d+)/ox
1823 $self->element(
1825 'Name' => 'Parameters_expect',
1826 'Data' => $1
1829 $self->element(
1831 'Name' => 'Statistics_seqs_better_than_cutoff',
1832 'Data' => $2
1836 elsif (/^\s+Posted\s+date:\s+(.+)/) {
1837 my $d = $1;
1838 chomp($d);
1839 $self->element(
1841 'Name' => 'Statistics_posted_date',
1842 'Data' => $d
1846 elsif ( !/^\s+$/ ) {
1847 #$self->debug( "unmatched stat $_");
1850 $last = $_;
1852 } elsif ( $self->in_element('hsp') ) {
1853 $self->debug("blast.pm: Processing HSP\n");
1854 # let's read 3 lines at a time;
1855 # bl2seq hackiness... Not sure I like
1856 $self->{'_reporttype'} ||= $DEFAULTREPORTTYPE;
1857 my %data = (
1858 'Query' => '',
1859 'Mid' => '',
1860 'Hit' => ''
1862 my $len;
1863 for ( my $i = 0 ; defined($_) && $i < 3 ; $i++ ) {
1864 # $self->debug("$i: $_") if $v;
1865 if ( ( $i == 0 && /^\s+$/) ||
1866 /^\s*(?:Lambda|Minus|Plus|Score)/i )
1868 $self->_pushback($_) if defined $_;
1869 $self->end_element( { 'Name' => 'Hsp' } );
1870 last;
1872 chomp;
1873 if (/^((Query|Sbjct):?\s+(\-?\d+)?\s*)(\S+)\s+(\-?\d+)?/) {
1874 my ( $full, $type, $start, $str, $end ) =
1875 ( $1, $2, $3, $4, $5 );
1877 if ( $str eq '-' ) {
1878 $i = 3 if $type eq 'Sbjct';
1880 else {
1881 $data{$type} = $str;
1883 $len = length($full);
1884 $self->{"\_$type"}->{'begin'} = $start
1885 unless $self->{"_$type"}->{'begin'};
1886 $self->{"\_$type"}->{'end'} = $end;
1887 } elsif (/^((Query|Sbjct):?\s+(\-?0+)\s*)/) {
1888 # Bug from NCBI's BLAST: unaligned output
1889 $_ = $self->_readline() for 0..1;
1890 last;
1891 } else {
1892 $self->throw("no data for midline $_")
1893 unless ( defined $_ && defined $len );
1894 $data{'Mid'} = substr( $_, $len );
1896 $_ = $self->_readline();
1898 $self->characters(
1900 'Name' => 'Hsp_qseq',
1901 'Data' => $data{'Query'}
1904 $self->characters(
1906 'Name' => 'Hsp_hseq',
1907 'Data' => $data{'Sbjct'}
1910 $self->characters(
1912 'Name' => 'Hsp_midline',
1913 'Data' => $data{'Mid'}
1917 else {
1918 #$self->debug("blast.pm: unrecognized line $_");
1922 $self->debug("blast.pm: End of BlastOutput\n");
1923 if ( $self->{'_seentop'} ) {
1924 $self->within_element('hsp')
1925 && $self->end_element( { 'Name' => 'Hsp' } );
1926 $self->within_element('hit')
1927 && $self->end_element( { 'Name' => 'Hit' } );
1928 # cleanup extra hits
1929 $self->_cleanup_hits(\@hit_signifs) if scalar(@hit_signifs);
1930 $self->within_element('iteration')
1931 && $self->end_element( { 'Name' => 'Iteration' } );
1932 if ($bl2seq_fix) {
1933 $self->element(
1935 'Name' => 'BlastOutput_program',
1936 'Data' => $reporttype
1940 $self->end_element( { 'Name' => 'BlastOutput' } );
1942 return $self->end_document();
1945 # Private method for internal use only.
1946 sub _start_blastoutput {
1947 my $self = shift;
1948 $self->start_element( { 'Name' => 'BlastOutput' } );
1949 $self->{'_seentop'} = 1;
1950 $self->{'_result_count'}++;
1951 $self->{'_handler_rc'} = undef;
1954 =head2 _will_handle
1956 Title : _will_handle
1957 Usage : Private method. For internal use only.
1958 if( $self->_will_handle($type) ) { ... }
1959 Function: Provides an optimized way to check whether or not an element of a
1960 given type is to be handled.
1961 Returns : Reference to EventHandler object if the element type is to be handled.
1962 undef if the element type is not to be handled.
1963 Args : string containing type of element.
1965 Optimizations:
1967 =over 2
1969 =item 1
1971 Using the cached pointer to the EventHandler to minimize repeated
1972 lookups.
1974 =item 2
1976 Caching the will_handle status for each type that is encountered so
1977 that it only need be checked by calling
1978 handler-E<gt>will_handle($type) once.
1980 =back
1982 This does not lead to a major savings by itself (only 5-10%). In
1983 combination with other optimizations, or for large parse jobs, the
1984 savings good be significant.
1986 To test against the unoptimized version, remove the parentheses from
1987 around the third term in the ternary " ? : " operator and add two
1988 calls to $self-E<gt>_eventHandler().
1990 =cut
1992 sub _will_handle {
1993 my ( $self, $type ) = @_;
1994 my $handler = $self->{'_handler_cache'};
1995 my $will_handle =
1996 defined( $self->{'_will_handle_cache'}->{$type} )
1997 ? $self->{'_will_handle_cache'}->{$type}
1998 : ( $self->{'_will_handle_cache'}->{$type} =
1999 $handler->will_handle($type) );
2001 return $will_handle ? $handler : undef;
2004 =head2 start_element
2006 Title : start_element
2007 Usage : $eventgenerator->start_element
2008 Function: Handles a start element event
2009 Returns : none
2010 Args : hashref with at least 2 keys 'Data' and 'Name'
2012 =cut
2014 sub start_element {
2015 my ( $self, $data ) = @_;
2017 # we currently don't care about attributes
2018 my $nm = $data->{'Name'};
2019 my $type = $MODEMAP{$nm};
2020 if ($type) {
2021 my $handler = $self->_will_handle($type);
2022 if ($handler) {
2023 my $func = sprintf( "start_%s", lc $type );
2024 $self->{'_handler_rc'} = $handler->$func( $data->{'Attributes'} );
2026 #else {
2027 #$self->debug( # changed 4/29/2006 to play nice with other event handlers
2028 # "Bio::SearchIO::InternalParserError ".
2029 # "\nCan't handle elements of type \'$type.\'"
2032 unshift @{ $self->{'_elements'} }, $type;
2033 if ( $type eq 'result' ) {
2034 $self->{'_values'} = {};
2035 $self->{'_result'} = undef;
2036 } else {
2037 # cleanup some things
2038 if ( defined $self->{'_values'} ) {
2039 foreach my $k (
2040 grep { /^\U$type\-/ }
2041 keys %{ $self->{'_values'} }
2044 delete $self->{'_values'}->{$k};
2051 =head2 end_element
2053 Title : end_element
2054 Usage : $eventgenerator->end_element
2055 Function: Handles an end element event
2056 Returns : hashref with an element's worth of data
2057 Args : hashref with at least 2 keys 'Data' and 'Name'
2060 =cut
2062 sub end_element {
2063 my ( $self, $data ) = @_;
2065 my $nm = $data->{'Name'};
2066 my $type;
2067 my $rc;
2068 # cache these (TODO: we should probably cache all cross-report data)
2069 if ( $nm eq 'BlastOutput_program' ) {
2070 if ( $self->{'_last_data'} =~ /(t?blast[npx])/i ) {
2071 $self->{'_reporttype'} = uc $1;
2073 $self->{'_reporttype'} ||= $DEFAULTREPORTTYPE;
2076 if ( $nm eq 'BlastOutput_version' ) {
2077 $self->{'_reportversion'} = $self->{'_last_data'};
2080 # Hsps are sort of weird, in that they end when another
2081 # object begins so have to detect this in end_element for now
2082 if ( $nm eq 'Hsp' ) {
2083 foreach (qw(Hsp_qseq Hsp_midline Hsp_hseq Hsp_features)) {
2084 $self->element(
2086 'Name' => $_,
2087 'Data' => $self->{'_last_hspdata'}->{$_}
2089 ) if defined $self->{'_last_hspdata'}->{$_};
2091 $self->{'_last_hspdata'} = {};
2092 $self->element(
2094 'Name' => 'Hsp_query-from',
2095 'Data' => $self->{'_Query'}->{'begin'}
2098 $self->element(
2100 'Name' => 'Hsp_query-to',
2101 'Data' => $self->{'_Query'}->{'end'}
2105 $self->element(
2107 'Name' => 'Hsp_hit-from',
2108 'Data' => $self->{'_Sbjct'}->{'begin'}
2111 $self->element(
2113 'Name' => 'Hsp_hit-to',
2114 'Data' => $self->{'_Sbjct'}->{'end'}
2118 # } elsif( $nm eq 'Iteration' ) {
2119 # Nothing special needs to be done here.
2121 if ( $type = $MODEMAP{$nm} ) {
2122 my $handler = $self->_will_handle($type);
2123 if ($handler) {
2124 my $func = sprintf( "end_%s", lc $type );
2125 $rc = $handler->$func( $self->{'_reporttype'}, $self->{'_values'} );
2127 shift @{ $self->{'_elements'} };
2130 elsif ( $MAPPING{$nm} ) {
2131 if ( ref( $MAPPING{$nm} ) =~ /hash/i ) {
2133 # this is where we shove in the data from the
2134 # hashref info about params or statistics
2135 my $key = ( keys %{ $MAPPING{$nm} } )[0];
2136 $self->{'_values'}->{$key}->{ $MAPPING{$nm}->{$key} } =
2137 $self->{'_last_data'};
2139 else {
2140 $self->{'_values'}->{ $MAPPING{$nm} } = $self->{'_last_data'};
2143 else {
2144 #$self->debug("blast.pm: unknown nm $nm, ignoring\n");
2146 $self->{'_last_data'} = ''; # remove read data if we are at
2147 # end of an element
2148 $self->{'_result'} = $rc if ( defined $type && $type eq 'result' );
2149 $self->{'_seen_hsp_features'} = 0;
2150 return $rc;
2153 =head2 element
2155 Title : element
2156 Usage : $eventhandler->element({'Name' => $name, 'Data' => $str});
2157 Function: Convenience method that calls start_element, characters, end_element
2158 Returns : none
2159 Args : Hash ref with the keys 'Name' and 'Data'
2162 =cut
2164 sub element {
2165 my ( $self, $data ) = @_;
2166 # Note that start element isn't needed for character data
2167 # Not too SAX-y, though
2168 #$self->start_element($data);
2169 $self->characters($data);
2170 $self->end_element($data);
2173 =head2 characters
2175 Title : characters
2176 Usage : $eventgenerator->characters($str)
2177 Function: Send a character events
2178 Returns : none
2179 Args : string
2182 =cut
2184 sub characters {
2185 my ( $self, $data ) = @_;
2186 if ( $self->in_element('hsp')
2187 && $data->{'Name'} =~ /^Hsp\_(qseq|hseq|midline)$/ )
2189 $self->{'_last_hspdata'}->{ $data->{'Name'} } .= $data->{'Data'}
2190 if defined $data->{'Data'};
2192 return unless ( defined $data->{'Data'} && $data->{'Data'} !~ /^\s+$/ );
2193 $self->{'_last_data'} = $data->{'Data'};
2196 =head2 within_element
2198 Title : within_element
2199 Usage : if( $eventgenerator->within_element($element) ) {}
2200 Function: Test if we are within a particular element
2201 This is different than 'in' because within can be tested
2202 for a whole block.
2203 Returns : boolean
2204 Args : string element name
2206 See Also: L<in_element>
2208 =cut
2210 sub within_element {
2211 my ( $self, $name ) = @_;
2212 return 0
2213 if ( !defined $name && !defined $self->{'_elements'}
2214 || scalar @{ $self->{'_elements'} } == 0 );
2215 foreach ( @{ $self->{'_elements'} } ) {
2216 if ( $_ eq $name ) {
2217 return 1;
2220 return 0;
2223 =head2 in_element
2225 Title : in_element
2226 Usage : if( $eventgenerator->in_element($element) ) {}
2227 Function: Test if we are in a particular element
2228 This is different than 'within_element' because within
2229 can be tested for a whole block.
2230 Returns : boolean
2231 Args : string element name
2233 See Also: L<within_element>
2235 =cut
2237 sub in_element {
2238 my ( $self, $name ) = @_;
2239 return 0 if !defined $self->{'_elements'}->[0];
2240 return ( $self->{'_elements'}->[0] eq $name );
2243 =head2 start_document
2245 Title : start_document
2246 Usage : $eventgenerator->start_document
2247 Function: Handle a start document event
2248 Returns : none
2249 Args : none
2252 =cut
2254 sub start_document {
2255 my ($self) = @_;
2256 $self->{'_lasttype'} = '';
2257 $self->{'_values'} = {};
2258 $self->{'_result'} = undef;
2259 $self->{'_elements'} = [];
2262 =head2 end_document
2264 Title : end_document
2265 Usage : $eventgenerator->end_document
2266 Function: Handles an end document event
2267 Returns : Bio::Search::Result::ResultI object
2268 Args : none
2271 =cut
2273 sub end_document {
2274 my ( $self, @args ) = @_;
2276 #$self->debug("blast.pm: end_document\n");
2277 return $self->{'_result'};
2280 sub write_result {
2281 my ( $self, $blast, @args ) = @_;
2283 if ( not defined( $self->writer ) ) {
2284 $self->warn("Writer not defined. Using a $DEFAULT_BLAST_WRITER_CLASS");
2285 $self->writer( $DEFAULT_BLAST_WRITER_CLASS->new() );
2287 $self->SUPER::write_result( $blast, @args );
2290 sub result_count {
2291 my $self = shift;
2292 return $self->{'_result_count'};
2295 sub report_count { shift->result_count }
2297 =head2 inclusion_threshold
2299 Title : inclusion_threshold
2300 Usage : my $incl_thresh = $isreb->inclusion_threshold;
2301 : $isreb->inclusion_threshold(1e-5);
2302 Function: Get/Set the e-value threshold for inclusion in the PSI-BLAST
2303 score matrix model (blastpgp) that was used for generating the reports
2304 being parsed.
2305 Returns : number (real)
2306 Default value: $Bio::SearchIO::IteratedSearchResultEventBuilder::DEFAULT_INCLUSION_THRESHOLD
2307 Args : number (real) (e.g., 0.0001 or 1e-4 )
2309 =cut
2311 =head2 max_significance
2313 Usage : $obj->max_significance();
2314 Purpose : Set/Get the P or Expect value used as significance screening cutoff.
2315 This is the value of the -signif parameter supplied to new().
2316 Hits with P or E-value above this are skipped.
2317 Returns : Scientific notation number with this format: 1.0e-05.
2318 Argument : Scientific notation number or float (when setting)
2319 Comments : Screening of significant hits uses the data provided on the
2320 : description line. For NCBI BLAST1 and WU-BLAST, this data
2321 : is P-value. for NCBI BLAST2 it is an Expect value.
2323 =cut
2325 =head2 signif
2327 Synonym for L<max_significance()|max_significance>
2329 =cut
2331 =head2 min_score
2333 Usage : $obj->min_score();
2334 Purpose : Set/Get the Blast score used as screening cutoff.
2335 This is the value of the -score parameter supplied to new().
2336 Hits with scores below this are skipped.
2337 Returns : Integer or scientific notation number.
2338 Argument : Integer or scientific notation number (when setting)
2339 Comments : Screening of significant hits uses the data provided on the
2340 : description line.
2342 =cut
2344 =head2 min_query_length
2346 Usage : $obj->min_query_length();
2347 Purpose : Gets the query sequence length used as screening criteria.
2348 This is the value of the -min_query_len parameter supplied to new().
2349 Hits with sequence length below this are skipped.
2350 Returns : Integer
2351 Argument : n/a
2353 =cut
2355 =head2 best_hit_only
2357 Title : best_hit_only
2358 Usage : print "only getting best hit.\n" if $obj->best_hit_only;
2359 Purpose : Set/Get the indicator for whether or not to process only
2360 : the best BlastHit.
2361 Returns : Boolean (1 | 0)
2362 Argument : Boolean (1 | 0) (when setting)
2364 =cut
2366 =head2 check_all_hits
2368 Title : check_all_hits
2369 Usage : print "checking all hits.\n" if $obj->check_all_hits;
2370 Purpose : Set/Get the indicator for whether or not to process all hits.
2371 : If false, the parser will stop processing hits after the
2372 : the first non-significance hit or the first hit that fails
2373 : any hit filter.
2374 Returns : Boolean (1 | 0)
2375 Argument : Boolean (1 | 0) (when setting)
2377 =cut
2379 # general private method used to make minimal hits from leftover
2380 # data in the hit table
2382 sub _cleanup_hits {
2383 my ($self, $hits) = @_;
2384 while ( my $v = shift @{ $hits }) {
2385 next unless defined $v;
2386 $self->start_element( { 'Name' => 'Hit' } );
2387 my $id = $v->[2];
2388 my $desc = $v->[3];
2389 $self->element(
2391 'Name' => 'Hit_id',
2392 'Data' => $id
2395 my ($gi, $acc, $version ) = $self->_get_seq_identifiers($id);
2396 $self->element(
2398 'Name' => 'Hit_accession',
2399 'Data' => $acc
2402 if ( defined $v ) {
2403 $self->element(
2405 'Name' => 'Hit_signif',
2406 'Data' => $v->[0]
2409 if (exists $self->{'_wublast'}) {
2410 $self->element(
2412 'Name' => 'Hit_score',
2413 'Data' => $v->[1]
2416 } else {
2417 $self->element(
2419 'Name' => 'Hit_bits',
2420 'Data' => $v->[1]
2426 $self->element(
2428 'Name' => 'Hit_def',
2429 'Data' => $desc
2432 $self->end_element( { 'Name' => 'Hit' } );
2439 __END__
2441 Developer Notes
2442 ---------------
2444 The following information is added in hopes of increasing the
2445 maintainability of this code. It runs the risk of becoming obsolete as
2446 the code gets updated. As always, double check against the actual
2447 source. If you find any discrepencies, please correct them.
2448 [ This documentation added on 3 Jun 2003. ]
2450 The logic is the brainchild of Jason Stajich, documented by Steve
2451 Chervitz. Jason: please check it over and modify as you see fit.
2453 Question:
2454 Elmo wants to know: How does this module unmarshall data from the input stream?
2455 (i.e., how does information from a raw input file get added to
2456 the correct Bioperl object?)
2458 Answer:
2460 This answer is specific to SearchIO::blast, but may apply to other
2461 SearchIO.pm subclasses as well. The following description gives the
2462 basic idea. The actual processing is a little more complex for
2463 certain types of data (HSP, Report Parameters).
2465 You can think of blast::next_result() as faking a SAX XML parser,
2466 making a non-XML document behave like its XML. The overhead to do this
2467 is quite substantial (~650 lines of code instead of ~80 in
2468 blastxml.pm).
2470 0. First, add a key => value pair for the datum of interest to %MAPPING
2471 Example:
2472 'Foo_bar' => 'Foo-bar',
2474 1. next_result() collects the datum of interest from the input stream,
2475 and calls element().
2476 Example:
2477 $self->element({ 'Name' => 'Foo_bar',
2478 'Data' => $foobar});
2480 2. The element() method is a convenience method that calls start_element(),
2481 characters(), and end_element().
2483 3. start_element() checks to see if the event handler can handle a start_xxx(),
2484 where xxx = the 'Name' parameter passed into element(), and calls start_xxx()
2485 if so. Otherwise, start_element() does not do anything.
2487 Data that will have such an event handler are defined in %MODEMAP.
2488 Typically, there are only handler methods for the main parts of
2489 the search result (e.g., Result, Iteration, Hit, HSP),
2490 which have corresponding Bioperl modules. So in this example,
2491 there was an earlier call such as $self->element({'Name'=>'Foo'})
2492 and the Foo_bar datum is meant to ultimately go into a Foo object.
2494 The start_foo() method in the handler will typically do any
2495 data initialization necessary to prepare for creating a new Foo object.
2496 Example: SearchResultEventBuilder::start_result()
2498 4. characters() takes the value of the 'Data' key from the hashref argument in
2499 the elements() call and saves it in a local data member:
2500 Example:
2501 $self->{'_last_data'} = $data->{'Data'};
2503 5. end_element() is like start_element() in that it does the check for whether
2504 the event handler can handle end_xxx() and if so, calls it, passing in
2505 the data collected from all of the characters() calls that occurred
2506 since the start_xxx() call.
2508 If there isn't any special handler for the data type specified by 'Name',
2509 end_element() will place the data saved by characters() into another
2510 local data member that saves it in a hash with a key defined by %MAPPING.
2511 Example:
2512 $nm = $data->{'Name'};
2513 $self->{'_values'}->{$MAPPING{$nm}} = $self->{'_last_data'};
2515 In this case, $MAPPING{$nm} is 'Foo-bar'.
2517 end_element() finishes by resetting the local data member used by
2518 characters(). (i.e., $self->{'_last_data'} = '';)
2520 6. When the next_result() method encounters the end of the Foo element in the
2521 input stream. It will invoke $self->end_element({'Name'=>'Foo'}).
2522 end_element() then sends all of the data in the $self->{'_values'} hash.
2523 Note that $self->{'_values'} is cleaned out during start_element(),
2524 keeping it at a resonable size.
2526 In the event handler, the end_foo() method takes the hash from end_element()
2527 and creates a new hash containing the same data, but having keys lacking
2528 the 'Foo' prefix (e.g., 'Foo-bar' becomes '-bar'). The handler's end_foo()
2529 method then creates the Foo object, passing in this new hash as an argument.
2530 Example: SearchResultEventBuilder::end_result()
2532 7. Objects created from the data in the search result are managed by
2533 the event handler which adds them to a ResultI object (using API methods
2534 for that object). The ResultI object gets passed back to
2535 SearchIO::end_element() when it calls end_result().
2537 The ResultI object is then saved in an internal data member of the
2538 SearchIO object, which returns it at the end of next_result()
2539 by calling end_document().
2541 (Technical Note: All objects created by end_xxx() methods in the event
2542 handler are returned to SearchIO::end_element(), but the SearchIO object
2543 only cares about the ResultI objects.)
2545 (Sesame Street aficionados note: This answer was NOT given by Mr. Noodle ;-P)