changes all issue tracking in preparation for switch to github issues
[bioperl-live.git] / Bio / SearchIO / blast.pm
blob6daa6aedd59ef820a700f7113492e925506ab671
2 # BioPerl module for Bio::SearchIO::blast
4 # Please direct questions and support issues to <bioperl-l@bioperl.org>
6 # Cared for by Jason Stajich <jason@bioperl.org>
8 # Copyright Jason Stajich
10 # You may distribute this module under the same terms as perl itself
12 # POD documentation - main docs before the code
14 # 20030409 - sac
15 # PSI-BLAST full parsing support. Rollout of new
16 # model which will remove Steve's old psiblast driver
17 # 20030424 - jason
18 # Megablast parsing fix as reported by Neil Saunders
19 # 20030427 - jason
20 # Support bl2seq parsing
21 # 20031124 - jason
22 # Parse more blast statistics, lambda, entropy, etc
23 # from WU-BLAST in frame-specific manner
24 # 20060216 - cjf - fixed blast parsing for BLAST v2.2.13 output
25 # 20071104 - dmessina - added support for WUBLAST -echofilter
26 # 20071121 - cjf - fixed several bugs (bugs 2391, 2399, 2409)
28 =head1 NAME
30 Bio::SearchIO::blast - Event generator for event based parsing of
31 blast reports
33 =head1 SYNOPSIS
35 # Do not use this object directly - it is used as part of the
36 # Bio::SearchIO system.
38 use Bio::SearchIO;
39 my $searchio = Bio::SearchIO->new(-format => 'blast',
40 -file => 't/data/ecolitst.bls');
41 while( my $result = $searchio->next_result ) {
42 while( my $hit = $result->next_hit ) {
43 while( my $hsp = $hit->next_hsp ) {
44 # ...
49 =head1 DESCRIPTION
51 This object encapsulated the necessary methods for generating events
52 suitable for building Bio::Search objects from a BLAST report file.
53 Read the L<Bio::SearchIO> for more information about how to use this.
55 This driver can parse:
57 =over 4
59 =item *
61 NCBI produced plain text BLAST reports from blastall, this also
62 includes PSIBLAST, PSITBLASTN, RPSBLAST, and bl2seq reports. NCBI XML
63 BLAST output is parsed with the blastxml SearchIO driver
65 =item *
67 WU-BLAST all reports
69 =item *
71 Jim Kent's BLAST-like output from his programs (BLASTZ, BLAT)
73 =item *
75 BLAST-like output from Paracel BTK output
77 =back
79 =head2 bl2seq parsing
81 Since I cannot differentiate between BLASTX and TBLASTN since bl2seq
82 doesn't report the algorithm used - I assume it is BLASTX by default -
83 you can supply the program type with -report_type in the SearchIO
84 constructor i.e.
86 my $parser = Bio::SearchIO->new(-format => 'blast',
87 -file => 'bl2seq.tblastn.report',
88 -report_type => 'tblastn');
90 This only really affects where the frame and strand information are
91 put - they will always be on the $hsp-E<gt>query instead of on the
92 $hsp-E<gt>hit part of the feature pair for blastx and tblastn bl2seq
93 produced reports. Hope that's clear...
95 =head1 FEEDBACK
97 =head2 Mailing Lists
99 User feedback is an integral part of the evolution of this and other
100 Bioperl modules. Send your comments and suggestions preferably to
101 the Bioperl mailing list. Your participation is much appreciated.
103 bioperl-l@bioperl.org - General discussion
104 http://bioperl.org/wiki/Mailing_lists - About the mailing lists
106 =head2 Support
108 Please direct usage questions or support issues to the mailing list:
110 I<bioperl-l@bioperl.org>
112 rather than to the module maintainer directly. Many experienced and
113 reponsive experts will be able look at the problem and quickly
114 address it. Please include a thorough description of the problem
115 with code and data examples if at all possible.
117 =head2 Reporting Bugs
119 Report bugs to the Bioperl bug tracking system to help us keep track
120 of the bugs and their resolution. Bug reports can be submitted via the
121 web:
123 https://github.com/bioperl/bioperl-live/issues
125 =head1 AUTHOR - Jason Stajich
127 Email Jason Stajich jason-at-bioperl.org
129 =head1 CONTRIBUTORS
131 Steve Chervitz sac-at-bioperl.org
133 =head1 APPENDIX
135 The rest of the documentation details each of the object methods.
136 Internal methods are usually preceded with a _
138 =cut
140 # Let the code begin...'
142 package Bio::SearchIO::blast;
144 use Bio::SearchIO::IteratedSearchResultEventBuilder;
145 use strict;
146 use vars qw(%MAPPING %MODEMAP
147 $DEFAULT_BLAST_WRITER_CLASS
148 $MAX_HSP_OVERLAP
149 $DEFAULT_SIGNIF
150 $DEFAULT_SCORE
151 $DEFAULTREPORTTYPE
155 use base qw(Bio::SearchIO);
156 use Data::Dumper;
158 BEGIN {
160 # mapping of NCBI Blast terms to Bioperl hash keys
161 %MODEMAP = (
162 'BlastOutput' => 'result',
163 'Iteration' => 'iteration',
164 'Hit' => 'hit',
165 'Hsp' => 'hsp'
168 # This should really be done more intelligently, like with
169 # XSLT
171 %MAPPING = (
172 'Hsp_bit-score' => 'HSP-bits',
173 'Hsp_score' => 'HSP-score',
174 'Hsp_evalue' => 'HSP-evalue',
175 'Hsp_n', => 'HSP-n',
176 'Hsp_pvalue' => 'HSP-pvalue',
177 'Hsp_query-from' => 'HSP-query_start',
178 'Hsp_query-to' => 'HSP-query_end',
179 'Hsp_hit-from' => 'HSP-hit_start',
180 'Hsp_hit-to' => 'HSP-hit_end',
181 'Hsp_positive' => 'HSP-conserved',
182 'Hsp_identity' => 'HSP-identical',
183 'Hsp_gaps' => 'HSP-hsp_gaps',
184 'Hsp_hitgaps' => 'HSP-hit_gaps',
185 'Hsp_querygaps' => 'HSP-query_gaps',
186 'Hsp_qseq' => 'HSP-query_seq',
187 'Hsp_hseq' => 'HSP-hit_seq',
188 'Hsp_midline' => 'HSP-homology_seq',
189 'Hsp_align-len' => 'HSP-hsp_length',
190 'Hsp_query-frame' => 'HSP-query_frame',
191 'Hsp_hit-frame' => 'HSP-hit_frame',
192 'Hsp_links' => 'HSP-links',
193 'Hsp_group' => 'HSP-hsp_group',
194 'Hsp_features' => 'HSP-hit_features',
196 'Hit_id' => 'HIT-name',
197 'Hit_len' => 'HIT-length',
198 'Hit_accession' => 'HIT-accession',
199 'Hit_def' => 'HIT-description',
200 'Hit_signif' => 'HIT-significance',
201 # For NCBI blast, the description line contains bits.
202 # For WU-blast, the description line contains score.
203 'Hit_score' => 'HIT-score',
204 'Hit_bits' => 'HIT-bits',
206 'Iteration_iter-num' => 'ITERATION-number',
207 'Iteration_converged' => 'ITERATION-converged',
209 'BlastOutput_program' => 'RESULT-algorithm_name',
210 'BlastOutput_version' => 'RESULT-algorithm_version',
211 'BlastOutput_algorithm-reference' => 'RESULT-algorithm_reference',
212 'BlastOutput_rid' => 'RESULT-rid',
213 'BlastOutput_query-def' => 'RESULT-query_name',
214 'BlastOutput_query-len' => 'RESULT-query_length',
215 'BlastOutput_query-acc' => 'RESULT-query_accession',
216 'BlastOutput_query-gi' => 'RESULT-query_gi',
217 'BlastOutput_querydesc' => 'RESULT-query_description',
218 'BlastOutput_db' => 'RESULT-database_name',
219 'BlastOutput_db-len' => 'RESULT-database_entries',
220 'BlastOutput_db-let' => 'RESULT-database_letters',
221 'BlastOutput_inclusion-threshold' => 'RESULT-inclusion_threshold',
223 'Parameters_matrix' => { 'RESULT-parameters' => 'matrix' },
224 'Parameters_expect' => { 'RESULT-parameters' => 'expect' },
225 'Parameters_include' => { 'RESULT-parameters' => 'include' },
226 'Parameters_sc-match' => { 'RESULT-parameters' => 'match' },
227 'Parameters_sc-mismatch' => { 'RESULT-parameters' => 'mismatch' },
228 'Parameters_gap-open' => { 'RESULT-parameters' => 'gapopen' },
229 'Parameters_gap-extend' => { 'RESULT-parameters' => 'gapext' },
230 'Parameters_filter' => { 'RESULT-parameters' => 'filter' },
231 'Parameters_allowgaps' => { 'RESULT-parameters' => 'allowgaps' },
232 'Parameters_full_dbpath' => { 'RESULT-parameters' => 'full_dbpath' },
233 'Statistics_db-len' => { 'RESULT-statistics' => 'dbentries' },
234 'Statistics_db-let' => { 'RESULT-statistics' => 'dbletters' },
235 'Statistics_hsp-len' =>
236 { 'RESULT-statistics' => 'effective_hsplength' },
237 'Statistics_query-len' => { 'RESULT-statistics' => 'querylength' },
238 'Statistics_eff-space' => { 'RESULT-statistics' => 'effectivespace' },
239 'Statistics_eff-spaceused' =>
240 { 'RESULT-statistics' => 'effectivespaceused' },
241 'Statistics_eff-dblen' =>
242 { 'RESULT-statistics' => 'effectivedblength' },
243 'Statistics_kappa' => { 'RESULT-statistics' => 'kappa' },
244 'Statistics_lambda' => { 'RESULT-statistics' => 'lambda' },
245 'Statistics_entropy' => { 'RESULT-statistics' => 'entropy' },
246 'Statistics_gapped_kappa' => { 'RESULT-statistics' => 'kappa_gapped' },
247 'Statistics_gapped_lambda' =>
248 { 'RESULT-statistics' => 'lambda_gapped' },
249 'Statistics_gapped_entropy' =>
250 { 'RESULT-statistics' => 'entropy_gapped' },
252 'Statistics_framewindow' =>
253 { 'RESULT-statistics' => 'frameshiftwindow' },
254 'Statistics_decay' => { 'RESULT-statistics' => 'decayconst' },
256 'Statistics_hit_to_db' => { 'RESULT-statistics' => 'Hits_to_DB' },
257 'Statistics_num_suc_extensions' =>
258 { 'RESULT-statistics' => 'num_successful_extensions' },
259 'Statistics_length_adjustment' => { 'RESULT-statistics' => 'length_adjustment' },
260 'Statistics_number_of_hsps_better_than_expect_value_cutoff_without_gapping' =>
261 { 'RESULT-statistics' => 'number_of_hsps_better_than_expect_value_cutoff_without_gapping' },
262 'Statistics_number_of_hsps_gapped' => { 'RESULT-statistics' => 'number_of_hsps_gapped' },
263 'Statistics_number_of_hsps_successfully_gapped' => { 'RESULT-statistics' => 'number_of_hsps_successfully_gapped' },
265 # WU-BLAST stats
266 'Statistics_DFA_states' => { 'RESULT-statistics' => 'num_dfa_states' },
267 'Statistics_DFA_size' => { 'RESULT-statistics' => 'dfa_size' },
268 'Statistics_noprocessors' =>
269 { 'RESULT-statistics' => 'no_of_processors' },
270 'Statistics_neighbortime' =>
271 { 'RESULT-statistics' => 'neighborhood_generate_time' },
272 'Statistics_starttime' => { 'RESULT-statistics' => 'start_time' },
273 'Statistics_endtime' => { 'RESULT-statistics' => 'end_time' },
276 # add WU-BLAST Frame-Based Statistics
277 for my $frame ( 0 .. 3 ) {
278 for my $strand ( '+', '-' ) {
279 for my $ind (
280 qw(length efflength E S W T X X_gapped E2
281 E2_gapped S2)
284 $MAPPING{"Statistics_frame$strand$frame\_$ind"} =
285 { 'RESULT-statistics' => "Frame$strand$frame\_$ind" };
287 for my $val (qw(lambda kappa entropy )) {
288 for my $type (qw(used computed gapped)) {
289 my $key = "Statistics_frame$strand$frame\_$val\_$type";
290 my $val =
291 { 'RESULT-statistics' =>
292 "Frame$strand$frame\_$val\_$type" };
293 $MAPPING{$key} = $val;
299 # add Statistics
300 for my $stats (
301 qw(T A X1 X2 X3 S1 S2 X1_bits X2_bits X3_bits
302 S1_bits S2_bits num_extensions
303 num_successful_extensions
304 seqs_better_than_cutoff
305 posted_date
306 search_cputime total_cputime
307 search_actualtime total_actualtime
308 no_of_processors ctxfactor)
311 my $key = "Statistics_$stats";
312 my $val = { 'RESULT-statistics' => $stats };
313 $MAPPING{$key} = $val;
316 # add WU-BLAST Parameters
317 for my $param (
318 qw(span span1 span2 links warnings notes hspsepsmax
319 hspsepqmax topcomboN topcomboE postsw cpus wordmask
320 filter sort_by_pvalue sort_by_count sort_by_highscore
321 sort_by_totalscore sort_by_subjectlength noseqs gi qtype
322 qres V B Z Y M N)
325 my $key = "Parameters_$param";
326 my $val = { 'RESULT-parameters' => $param };
327 $MAPPING{$key} = $val;
330 $DEFAULT_BLAST_WRITER_CLASS = 'Bio::SearchIO::Writer::HitTableWriter';
331 $MAX_HSP_OVERLAP = 2; # Used when tiling multiple HSPs.
332 $DEFAULTREPORTTYPE = 'BLASTP'; # for bl2seq
335 =head2 new
337 Title : new
338 Usage : my $obj = Bio::SearchIO::blast->new(%args);
339 Function: Builds a new Bio::SearchIO::blast object
340 Returns : Bio::SearchIO::blast
341 Args : Key-value pairs:
342 -fh/-file => filehandle/filename to BLAST file
343 -format => 'blast'
344 -report_type => 'blastx', 'tblastn', etc -- only for bl2seq
345 reports when you want to distinguish between
346 tblastn and blastx reports (this only controls
347 where the frame information is put - on the query
348 or subject object.
349 -inclusion_threshold => e-value threshold for inclusion in the
350 PSI-BLAST score matrix model (blastpgp)
351 -signif => float or scientific notation number to be used
352 as a P- or Expect value cutoff
353 -score => integer or scientific notation number to be used
354 as a blast score value cutoff
355 -bits => integer or scientific notation number to be used
356 as a bit score value cutoff
357 -hit_filter => reference to a function to be used for
358 filtering hits based on arbitrary criteria.
359 All hits of each BLAST report must satisfy
360 this criteria to be retained.
361 If a hit fails this test, it is ignored.
362 This function should take a
363 Bio::Search::Hit::BlastHit.pm object as its first
364 argument and return true
365 if the hit should be retained.
366 Sample filter function:
367 -hit_filter => sub { $hit = shift;
368 $hit->gaps == 0; },
369 (Note: -filt_func is synonymous with -hit_filter)
370 -overlap => integer. The amount of overlap to permit between
371 adjacent HSPs when tiling HSPs. A reasonable value is 2.
372 Default = $Bio::SearchIO::blast::MAX_HSP_OVERLAP.
374 The following criteria are not yet supported:
375 (these are probably best applied within this module rather than in the
376 event handler since they would permit the parser to take some shortcuts.)
378 -check_all_hits => boolean. Check all hits for significance against
379 significance criteria. Default = false.
380 If false, stops processing hits after the first
381 non-significant hit or the first hit that fails
382 the hit_filter call. This speeds parsing,
383 taking advantage of the fact that the hits
384 are processed in the order they appear in the report.
385 -min_query_len => integer to be used as a minimum for query sequence length.
386 Reports with query sequences below this length will
387 not be processed. Default = no minimum length.
388 -best => boolean. Only process the best hit of each report;
389 default = false.
391 =cut
393 sub _initialize {
394 my ( $self, @args ) = @_;
395 $self->SUPER::_initialize(@args);
397 # Blast reports require a specialized version of the SREB due to the
398 # possibility of iterations (PSI-BLAST). Forwarding all arguments to it. An
399 # issue here is that we want to set new default object factories if none are
400 # supplied.
402 my $handler = Bio::SearchIO::IteratedSearchResultEventBuilder->new(@args);
403 $self->attach_EventHandler($handler);
405 # 2006-04-26 move this to the attach_handler function in this module so we
406 # can really reset the handler
407 # Optimization: caching
408 # the EventHandler since it is used a lot during the parse.
410 # $self->{'_handler_cache'} = $handler;
412 my ($rpttype ) = $self->_rearrange(
415 REPORT_TYPE)
417 @args
419 defined $rpttype && ( $self->{'_reporttype'} = $rpttype );
422 sub attach_EventHandler {
423 my ($self,$handler) = @_;
425 $self->SUPER::attach_EventHandler($handler);
427 # Optimization: caching the EventHandler since it is used a lot
428 # during the parse.
430 $self->{'_handler_cache'} = $handler;
431 return;
434 =head2 next_result
436 Title : next_result
437 Usage : my $hit = $searchio->next_result;
438 Function: Returns the next Result from a search
439 Returns : Bio::Search::Result::ResultI object
440 Args : none
442 =cut
444 sub next_result {
445 my ($self) = @_;
446 my $v = $self->verbose;
447 my $data = '';
448 my $flavor = '';
449 $self->{'_seentop'} = 0; # start next report at top
451 my ( $reporttype, $seenquery, $reportline, $reportversion );
452 my ( $seeniteration, $found_again );
453 my $incl_threshold = $self->inclusion_threshold;
454 my $bl2seq_fix;
455 $self->start_document(); # let the fun begin...
456 my (@hit_signifs);
457 my $gapped_stats = 0; # for switching between gapped/ungapped
458 # lambda, K, H
459 local $_ = "\n"; #consistency
460 PARSER:
461 while ( defined( $_ = $self->_readline ) ) {
462 next if (/^\s+$/); # skip empty lines
463 next if (/CPU time:/);
464 next if (/^>\s*$/);
465 next if (/[*]+\s+No hits found\s+[*]+/);
466 if (
467 /^((?:\S+?)?BLAST[NPX]?)\s+(.+)$/i # NCBI BLAST, PSIBLAST
468 # RPSBLAST, MEGABLAST
469 || /^(P?GENEWISE|HFRAME|SWN|TSWN)\s+(.+)/i #Paracel BTK
472 ($reporttype, $reportversion) = ($1, $2);
473 # need to keep track of whether this is WU-BLAST
474 if ($reportversion && $reportversion =~ m{WashU$}) {
475 $self->{'_wublast'}++;
477 $self->debug("blast.pm: Start of new report: $reporttype, $reportversion\n");
478 if ( $self->{'_seentop'} ) {
479 # This handles multi-result input streams
480 $self->_pushback($_);
481 last PARSER;
483 $self->_start_blastoutput;
484 if ($reporttype =~ /RPS-BLAST/) {
485 $reporttype .= '(BLASTP)'; # default RPS-BLAST type
487 $reportline = $_; # to fix the fact that RPS-BLAST output is wrong
488 $self->element(
490 'Name' => 'BlastOutput_program',
491 'Data' => $reporttype
495 $self->element(
497 'Name' => 'BlastOutput_version',
498 'Data' => $reportversion
501 $self->element(
503 'Name' => 'BlastOutput_inclusion-threshold',
504 'Data' => $incl_threshold
508 # parse the BLAST algorithm reference
509 elsif(/^Reference:\s+(.*)$/) {
510 # want to preserve newlines for the BLAST algorithm reference
511 my $algorithm_reference = "$1\n";
512 $_ = $self->_readline;
513 # while the current line, does not match an empty line, a RID:, a
514 # Database:, or a query definition line (Query=) we are still
515 # looking at the algorithm_reference, append it to what we parsed so
516 # far
517 while($_ !~ /^$/ && $_ !~ /^RID:/ && $_ !~ /^Database:/ && $_ !~ /^Query=/) {
518 $algorithm_reference .= "$_";
519 $_ = $self->_readline;
521 # if we exited the while loop, we saw an empty line, a RID:, or a
522 # Database:, so push it back
523 $self->_pushback($_);
524 $self->element(
526 'Name' => 'BlastOutput_algorithm-reference',
527 'Data' => $algorithm_reference
531 # parse BLAST RID (Request ID)
532 elsif(/^RID:\s+(.*)$/) {
533 my $rid = $1;
534 $self->element(
536 'Name' => 'BlastOutput_rid',
537 'Data' => $rid
541 # added Windows workaround for bug 1985
542 elsif (/^(Searching|Results from round)/) {
543 next unless $1 =~ /Results from round/;
544 $self->debug("blast.pm: Possible psi blast iterations found...\n");
546 $self->in_element('hsp')
547 && $self->end_element( { 'Name' => 'Hsp' } );
548 $self->in_element('hit')
549 && $self->end_element( { 'Name' => 'Hit' } );
550 if ( defined $seeniteration ) {
551 $self->within_element('iteration')
552 && $self->end_element( { 'Name' => 'Iteration' } );
553 $self->start_element( { 'Name' => 'Iteration' } );
555 else {
556 $self->start_element( { 'Name' => 'Iteration' } );
558 $seeniteration = 1;
560 elsif (/^Query=\s*(.*)$/) {
561 my $q = $1;
562 $self->debug("blast.pm: Query= found...$_\n");
563 my $size = 0;
564 if ( defined $seenquery ) {
565 $self->_pushback($_);
566 $self->_pushback($reportline) if $reportline;
567 last PARSER;
569 if ( !defined $reporttype ) {
570 $self->_start_blastoutput;
571 if ( defined $seeniteration ) {
572 $self->in_element('iteration')
573 && $self->end_element( { 'Name' => 'Iteration' } );
574 $self->start_element( { 'Name' => 'Iteration' } );
576 else {
577 $self->start_element( { 'Name' => 'Iteration' } );
579 $seeniteration = 1;
581 $seenquery = $q;
582 $_ = $self->_readline;
583 while ( defined($_) ) {
584 if (/^Database:/) {
585 $self->_pushback($_);
586 last;
588 # below line fixes length issue with BLAST v2.2.13; still works
589 # with BLAST v2.2.12
590 if ( /\((\-?[\d,]+)\s+letters.*\)/ || /^Length=(\-?[\d,]+)/ ) {
591 $size = $1;
592 $size =~ s/,//g;
593 last;
595 else {
596 # bug 2391
597 $q .= ($q =~ /\w$/ && $_ =~ /^\w/) ? " $_" : $_;
598 $q =~ s/\s+/ /g; # this catches the newline as well
599 $q =~ s/^ | $//g;
602 $_ = $self->_readline;
604 chomp($q);
605 my ( $nm, $desc ) = split( /\s+/, $q, 2 );
606 $self->element(
608 'Name' => 'BlastOutput_query-def',
609 'Data' => $nm
611 ) if $nm;
612 $self->element(
614 'Name' => 'BlastOutput_query-len',
615 'Data' => $size
618 defined $desc && $desc =~ s/\s+$//;
619 $self->element(
621 'Name' => 'BlastOutput_querydesc',
622 'Data' => $desc
625 my ( $gi, $acc, $version ) = $self->_get_seq_identifiers($nm);
626 $version = defined($version) && length($version) ? ".$version" : "";
627 $self->element(
629 'Name' => 'BlastOutput_query-acc',
630 'Data' => "$acc$version"
632 ) if $acc;
634 # these elements are dropped with some multiquery reports; add
635 # back here
636 $self->element(
638 'Name' => 'BlastOutput_db-len',
639 'Data' => $self->{'_blsdb_length'}
641 ) if $self->{'_blsdb_length'};
642 $self->element(
644 'Name' => 'BlastOutput_db-let',
645 'Data' => $self->{'_blsdb_letters'}
647 ) if $self->{'_blsdb_letters'};
648 $self->element(
650 'Name' => 'BlastOutput_db',
651 'Data' => $self->{'_blsdb'}
653 ) if $self->{'_blsdb_letters'};
655 # added check for WU-BLAST -echofilter option (bug 2388)
656 elsif (/^>Unfiltered[+-]1$/) {
657 # skip all of the lines of unfiltered sequence
658 while($_ !~ /^Database:/) {
659 $self->debug("Bypassing features line: $_");
660 $_ = $self->_readline;
662 $self->_pushback($_);
664 elsif (/Sequences producing significant alignments:/) {
665 $self->debug("blast.pm: Processing NCBI-BLAST descriptions\n");
666 $flavor = 'ncbi';
668 # PSI-BLAST parsing needs to be fixed to specifically look
669 # for old vs new per iteration, as sorting based on duplication
670 # leads to bugs, see bug 1986
672 # The next line is not necessarily whitespace in psiblast reports.
673 # Also note that we must look for the end of this section by testing
674 # for a line with a leading >. Blank lines occur with this section
675 # for psiblast.
676 if ( !$self->in_element('iteration') ) {
677 $self->start_element( { 'Name' => 'Iteration' } );
680 # changed 8/28/2008 to exit hit table if blank line is found after an
681 # appropriate line
682 my $h_regex;
683 my $seen_block;
684 DESCLINE:
685 while ( defined( my $descline = $self->_readline() ) ) {
686 if ($descline =~ m{^\s*$}) {
687 last DESCLINE if $seen_block;
688 next DESCLINE;
690 # any text match is part of block...
691 $seen_block++;
692 # GCG multiline oddness...
693 if ($descline =~ /^(\S+)\s+Begin:\s\d+\s+End:\s+\d+/xms) {
694 my ($id, $nextline) = ($1, $self->_readline);
695 $nextline =~ s{^!}{};
696 $descline = "$id $nextline";
698 # NCBI style hit table (no N)
699 if ($descline =~ /(?<!cor) # negative lookahead
700 (\d*\.?(?:[\+\-eE]+)?\d+) # number (float or scientific notation)
701 \s+ # space
702 (\d*\.?(?:[\+\-eE]+)?\d+) # number (float or scientific notation)
703 \s*$/xms) {
705 my ( $score, $evalue ) = ($1, $2);
707 # Some data clean-up so e-value will appear numeric to perl
708 $evalue =~ s/^e/1e/i;
710 # This to handle no-HSP case
711 my @line = split ' ',$descline;
713 # we want to throw away the score, evalue
714 pop @line, pop @line;
716 # and N if it is present (of course they are not
717 # really in that order, but it doesn't matter
718 if ($3) { pop @line }
720 # add the last 2 entries s.t. we can reconstruct
721 # a minimal Hit object at the end of the day
722 push @hit_signifs, [ $evalue, $score, shift @line, join( ' ', @line ) ];
723 } elsif ($descline =~ /^CONVERGED/i) {
724 $self->element(
726 'Name' => 'Iteration_converged',
727 'Data' => 1
730 } else {
731 $self->_pushback($descline); # Catch leading > (end of section)
732 last DESCLINE;
736 elsif (/Sequences producing High-scoring Segment Pairs:/) {
738 # This block is for WU-BLAST, so we don't have to check for psi-blast stuff
739 # skip the next line
740 $self->debug("blast.pm: Processing WU-BLAST descriptions\n");
741 $_ = $self->_readline();
742 $flavor = 'wu';
744 if ( !$self->in_element('iteration') ) {
745 $self->start_element( { 'Name' => 'Iteration' } );
748 while ( defined( $_ = $self->_readline() )
749 && !/^\s+$/ )
751 my @line = split;
752 pop @line; # throw away first number which is for 'N'col
754 # add the last 2 entries to array s.t. we can reconstruct
755 # a minimal Hit object at the end of the day
756 push @hit_signifs,
757 [ pop @line, pop @line, shift @line, join( ' ', @line ) ];
761 elsif (/^Database:\s*(.+?)\s*$/) {
763 $self->debug("blast.pm: Database: $1\n");
764 my $db = $1;
765 while ( defined( $_ = $self->_readline ) ) {
766 if (
767 /^\s+(\-?[\d\,]+|\S+)\s+sequences\;
768 \s+(\-?[\d,]+|\S+)\s+ # Deal with NCBI 2.2.8 OSX problems
769 total\s+letters/ox
772 my ( $s, $l ) = ( $1, $2 );
773 $s =~ s/,//g;
774 $l =~ s/,//g;
775 $self->element(
777 'Name' => 'BlastOutput_db-len',
778 'Data' => $s
781 $self->element(
783 'Name' => 'BlastOutput_db-let',
784 'Data' => $l
787 # cache for next round in cases with multiple queries
788 $self->{'_blsdb'} = $db;
789 $self->{'_blsdb_length'} = $s;
790 $self->{'_blsdb_letters'} = $l;
791 last;
793 else {
794 chomp;
795 $db .= $_;
798 $self->element(
800 'Name' => 'BlastOutput_db',
801 'Data' => $db
806 # move inside of a hit
807 elsif (/^(?:Subject=|>)\s*(\S+)\s*(.*)?/) {
808 chomp;
810 $self->debug("blast.pm: Hit: $1\n");
811 $self->in_element('hsp')
812 && $self->end_element( { 'Name' => 'Hsp' } );
813 $self->in_element('hit')
814 && $self->end_element( { 'Name' => 'Hit' } );
816 # special case when bl2seq reports don't have a leading
817 # Query=
818 if ( !$self->within_element('result') ) {
819 $self->_start_blastoutput;
820 $self->start_element( { 'Name' => 'Iteration' } );
822 elsif ( !$self->within_element('iteration') ) {
823 $self->start_element( { 'Name' => 'Iteration' } );
825 $self->start_element( { 'Name' => 'Hit' } );
826 my $id = $1;
827 my $restofline = $2;
829 $self->debug("Starting a hit: $1 $2\n");
830 $self->element(
832 'Name' => 'Hit_id',
833 'Data' => $id
836 my ($gi, $acc, $version ) = $self->_get_seq_identifiers($id);
837 $self->element(
839 'Name' => 'Hit_accession',
840 'Data' => $acc
843 # add hit significance (from the hit table)
844 # this is where Bug 1986 went awry
846 # Changed for Bug2409; hit->significance and hit->score/bits derived
847 # from HSPs, not hit table unless necessary
849 HITTABLE:
850 while (my $v = shift @hit_signifs) {
851 my $tableid = $v->[2];
852 if ($tableid !~ m{\Q$id\E}) {
853 $self->debug("Hit table ID $tableid doesn't match current hit id $id, checking next hit table entry...\n");
854 next HITTABLE;
855 } else {
856 last HITTABLE;
859 while ( defined( $_ = $self->_readline() ) ) {
860 next if (/^\s+$/);
861 chomp;
862 if (/Length\s*=\s*([\d,]+)/) {
863 my $l = $1;
864 $l =~ s/\,//g;
865 $self->element(
867 'Name' => 'Hit_len',
868 'Data' => $l
871 last;
873 else {
874 if ($restofline !~ /\s$/) { # bug #3235
875 s/^\s(?!\s)/\x01/; #new line to concatenate desc lines with <soh>
877 $restofline .= ($restofline =~ /\w$/ && $_ =~ /^\w/) ? " $_" : $_;
878 $restofline =~ s/\s+/ /g; # this catches the newline as well
879 $restofline =~ s/^ | $//g;
882 $restofline =~ s/\s+/ /g;
883 $self->element(
885 'Name' => 'Hit_def',
886 'Data' => $restofline
890 elsif (/\s+(Plus|Minus) Strand HSPs:/i) {
891 next;
893 elsif (
894 ( $self->in_element('hit') || $self->in_element('hsp') )
895 && # paracel genewise BTK
896 m/Score\s*=\s*(\S+)\s*bits\s* # Bit score
897 (?:\((\d+)\))?, # Raw score
898 \s+Log\-Length\sScore\s*=\s*(\d+) # Log-Length score
902 $self->in_element('hsp')
903 && $self->end_element( { 'Name' => 'Hsp' } );
904 $self->start_element( { 'Name' => 'Hsp' } );
906 $self->debug( "Got paracel genewise HSP score=$1\n");
908 # Some data clean-up so e-value will appear numeric to perl
909 my ( $bits, $score, $evalue ) = ( $1, $2, $3 );
910 $evalue =~ s/^e/1e/i;
911 $self->element(
913 'Name' => 'Hsp_score',
914 'Data' => $score
917 $self->element(
919 'Name' => 'Hsp_bit-score',
920 'Data' => $bits
923 $self->element(
925 'Name' => 'Hsp_evalue',
926 'Data' => $evalue
930 elsif (
931 ( $self->in_element('hit') || $self->in_element('hsp') )
932 && # paracel hframe BTK
933 m/Score\s*=\s*([^,\s]+), # Raw score
934 \s*Expect\s*=\s*([^,\s]+), # E-value
935 \s*P(?:\(\S+\))?\s*=\s*([^,\s]+) # P-value
939 $self->in_element('hsp')
940 && $self->end_element( { 'Name' => 'Hsp' } );
941 $self->start_element( { 'Name' => 'Hsp' } );
943 $self->debug( "Got paracel hframe HSP score=$1\n");
945 # Some data clean-up so e-value will appear numeric to perl
946 my ( $score, $evalue, $pvalue ) = ( $1, $2, $3 );
947 $evalue = "1$evalue" if $evalue =~ /^e/;
948 $pvalue = "1$pvalue" if $pvalue =~ /^e/;
950 $self->element(
952 'Name' => 'Hsp_score',
953 'Data' => $score
956 $self->element(
958 'Name' => 'Hsp_evalue',
959 'Data' => $evalue
962 $self->element(
964 'Name' => 'Hsp_pvalue',
965 'Data' => $pvalue
969 elsif (
970 ( $self->in_element('hit') || $self->in_element('hsp') )
971 && # wublast
972 m/Score\s*=\s*(\S+)\s* # Bit score
973 \(([\d\.]+)\s*bits\), # Raw score
974 \s*Expect\s*=\s*([^,\s]+), # E-value
975 \s*(?:Sum)?\s* # SUM
976 P(?:\(\d+\))?\s*=\s*([^,\s]+) # P-value
977 (?:\s*,\s+Group\s*\=\s*(\d+))? # HSP Group
980 { # wu-blast HSP parse
981 $self->in_element('hsp')
982 && $self->end_element( { 'Name' => 'Hsp' } );
983 $self->start_element( { 'Name' => 'Hsp' } );
985 # Some data clean-up so e-value will appear numeric to perl
986 my ( $score, $bits, $evalue, $pvalue, $group ) =
987 ( $1, $2, $3, $4, $5 );
988 $evalue =~ s/^e/1e/i;
989 $pvalue =~ s/^e/1e/i;
991 $self->element(
993 'Name' => 'Hsp_score',
994 'Data' => $score
997 $self->element(
999 'Name' => 'Hsp_bit-score',
1000 'Data' => $bits
1003 $self->element(
1005 'Name' => 'Hsp_evalue',
1006 'Data' => $evalue
1009 $self->element(
1011 'Name' => 'Hsp_pvalue',
1012 'Data' => $pvalue
1016 if ( defined $group ) {
1017 $self->element(
1019 'Name' => 'Hsp_group',
1020 'Data' => $group
1026 elsif (
1027 ( $self->in_element('hit') || $self->in_element('hsp') )
1028 && # ncbi blast, works with 2.2.17
1029 m/^\sFeatures\s\w+\sthis\spart/xmso
1030 # If the line begins with "Features in/flanking this part of subject sequence:"
1033 $self->in_element('hsp')
1034 && $self->end_element( { 'Name' => 'Hsp' } );
1035 my $featline;
1036 $_ = $self->_readline;
1037 while($_ !~ /^\s*$/) {
1038 chomp;
1039 $featline .= $_;
1040 $_ = $self->_readline;
1042 $self->_pushback($_);
1043 $featline =~ s{(?:^\s+|\s+^)}{}g;
1044 $featline =~ s{\n}{;}g;
1045 $self->start_element( { 'Name' => 'Hsp' } );
1046 $self->element(
1048 'Name' => 'Hsp_features',
1049 'Data' => $featline
1052 $self->{'_seen_hsp_features'} = 1;
1054 elsif (
1055 ( $self->in_element('hit') || $self->in_element('hsp') )
1056 && # ncbi blast, works with 2.2.17
1057 m/Score\s*=\s*(\S+)\s*bits\s* # Bit score
1058 (?:\((\d+)\))?, # Missing for BLAT pseudo-BLAST fmt
1059 \s*Expect(?:\((\d+\+?)\))?\s*=\s*([^,\s]+) # E-value
1062 { # parse NCBI blast HSP
1063 if( !$self->{'_seen_hsp_features'} ) {
1064 $self->in_element('hsp')
1065 && $self->end_element( { 'Name' => 'Hsp' } );
1066 $self->start_element( { 'Name' => 'Hsp' } );
1069 # Some data clean-up so e-value will appear numeric to perl
1070 my ( $bits, $score, $n, $evalue ) = ( $1, $2, $3, $4 );
1071 $evalue =~ s/^e/1e/i;
1072 $self->element(
1074 'Name' => 'Hsp_score',
1075 'Data' => $score
1078 $self->element(
1080 'Name' => 'Hsp_bit-score',
1081 'Data' => $bits
1084 $self->element(
1086 'Name' => 'Hsp_evalue',
1087 'Data' => $evalue
1090 $self->element(
1092 'Name' => 'Hsp_n',
1093 'Data' => $n
1095 ) if defined $n;
1096 $score = '' unless defined $score; # deal with BLAT which
1097 # has no score only bits
1098 $self->debug("Got NCBI HSP score=$score, evalue $evalue\n");
1100 elsif (
1101 $self->in_element('hsp')
1102 && m/Identities\s*=\s*(\d+)\s*\/\s*(\d+)\s*[\d\%\(\)]+\s*
1103 (?:,\s*Positives\s*=\s*(\d+)\/(\d+)\s*[\d\%\(\)]+\s*)? # pos only valid for Protein alignments
1104 (?:\,\s*Gaps\s*=\s*(\d+)\/(\d+))? # Gaps
1105 /oxi
1108 $self->element(
1110 'Name' => 'Hsp_identity',
1111 'Data' => $1
1114 $self->element(
1116 'Name' => 'Hsp_align-len',
1117 'Data' => $2
1120 if ( defined $3 ) {
1121 $self->element(
1123 'Name' => 'Hsp_positive',
1124 'Data' => $3
1128 else {
1129 $self->element(
1131 'Name' => 'Hsp_positive',
1132 'Data' => $1
1136 if ( defined $6 ) {
1137 $self->element(
1139 'Name' => 'Hsp_gaps',
1140 'Data' => $5
1145 $self->{'_Query'} = { 'begin' => 0, 'end' => 0 };
1146 $self->{'_Sbjct'} = { 'begin' => 0, 'end' => 0 };
1148 if (/(Frame\s*=\s*.+)$/) {
1150 # handle wu-blast Frame listing on same line
1151 $self->_pushback($1);
1154 elsif ( $self->in_element('hsp')
1155 && /Strand\s*=\s*(Plus|Minus)\s*\/\s*(Plus|Minus)/i )
1158 # consume this event ( we infer strand from start/end)
1159 if (!defined($reporttype)) {
1160 $self->{'_reporttype'} = $reporttype = 'BLASTN';
1161 $bl2seq_fix = 1; # special case to resubmit the algorithm
1162 # reporttype
1164 next;
1166 elsif ( $self->in_element('hsp')
1167 && /Links\s*=\s*(\S+)/ox )
1169 $self->element(
1171 'Name' => 'Hsp_links',
1172 'Data' => $1
1176 elsif ( $self->in_element('hsp')
1177 && /Frame\s*=\s*([\+\-][1-3])\s*(\/\s*([\+\-][1-3]))?/ )
1179 my $frame1 = $1 || 0;
1180 my $frame2 = $2 || 0;
1181 # this is for bl2seq only
1182 if ( not defined $reporttype ) {
1183 $bl2seq_fix = 1;
1184 if ( $frame1 && $frame2 ) {
1185 $reporttype = 'TBLASTX'
1187 else {
1188 # We can distinguish between BLASTX and TBLASTN from the report
1189 # (and assign $frame1 properly) by using the start/end from query.
1190 # If the report is BLASTX, the coordinates distance from query
1191 # will be 3 times the length of the alignment shown (coordinates in nt,
1192 # alignment in aa); if not then subject is the nucleotide sequence (TBLASTN).
1193 # Will have to fast-forward to query alignment line and then go back.
1194 my $fh = $self->_fh;
1195 my $file_pos = tell $fh;
1197 my $a_position = '';
1198 my $ali_length = '';
1199 my $b_position = '';
1200 while (my $line = <$fh>) {
1201 if ($line =~ m/^(?:Query|Sbjct):?\s+(\-?\d+)?\s*(\S+)\s+(\-?\d+)?/) {
1202 $a_position = $1;
1203 my $alignment = $2;
1204 $b_position = $3;
1206 use Bio::LocatableSeq;
1207 my $gap_symbols = $Bio::LocatableSeq::GAP_SYMBOLS;
1208 $alignment =~ s/[$gap_symbols]//g;
1209 $ali_length = length($alignment);
1210 last;
1213 my $coord_length = ($a_position < $b_position) ? ($b_position - $a_position + 1)
1214 : ($a_position - $b_position + 1);
1215 ($coord_length == ($ali_length * 3)) ? ($reporttype = 'BLASTX') : ($reporttype = 'TBLASTN');
1217 # Rewind filehandle to its original position to continue parsing
1218 seek $fh, $file_pos, 0;
1220 $self->{'_reporttype'} = $reporttype;
1223 my ( $queryframe, $hitframe );
1224 if ( $reporttype eq 'TBLASTX' ) {
1225 ( $queryframe, $hitframe ) = ( $frame1, $frame2 );
1226 $hitframe =~ s/\/\s*//g;
1228 elsif ( $reporttype eq 'TBLASTN' || $reporttype eq 'PSITBLASTN') {
1229 ( $hitframe, $queryframe ) = ( $frame1, 0 );
1231 elsif ( $reporttype eq 'BLASTX' || $reporttype eq 'RPS-BLAST(BLASTP)') {
1232 ( $queryframe, $hitframe ) = ( $frame1, 0 );
1233 # though NCBI doesn't report it, this is a special BLASTX-like
1234 # RPS-BLAST; should be handled differently
1235 if ($reporttype eq 'RPS-BLAST(BLASTP)') {
1236 $self->element(
1238 'Name' => 'BlastOutput_program',
1239 'Data' => 'RPS-BLAST(BLASTX)'
1244 $self->element(
1246 'Name' => 'Hsp_query-frame',
1247 'Data' => $queryframe
1251 $self->element(
1253 'Name' => 'Hsp_hit-frame',
1254 'Data' => $hitframe
1258 elsif (/^Parameters:/
1259 || /^\s+Database:\s+?/
1260 || /^\s+Subset/
1261 || /^\s*Lambda/
1262 || /^\s*Histogram/
1263 || ( $self->in_element('hsp') && /WARNING|NOTE/ ) )
1266 # Note: Lambda check was necessary to parse
1267 # t/data/ecoli_domains.rpsblast AND to parse bl2seq
1268 $self->debug("blast.pm: found parameters section \n");
1270 $self->in_element('hsp')
1271 && $self->end_element( { 'Name' => 'Hsp' } );
1272 $self->in_element('hit')
1273 && $self->end_element( { 'Name' => 'Hit' } );
1275 # This is for the case when we specify -b 0 (or B=0 for WU-BLAST)
1276 # and still want to construct minimal Hit objects
1277 $self->_cleanup_hits(\@hit_signifs) if scalar(@hit_signifs);
1278 $self->within_element('iteration')
1279 && $self->end_element( { 'Name' => 'Iteration' } );
1281 next if /^\s+Subset/;
1282 my $blast = (/^(\s+Database\:)|(\s*Lambda)/) ? 'ncbi' : 'wublast';
1283 if (/^\s*Histogram/) {
1284 $blast = 'btk';
1287 my $last = '';
1289 # default is that gaps are allowed
1290 $self->element(
1292 'Name' => 'Parameters_allowgaps',
1293 'Data' => 'yes'
1296 while ( defined( $_ = $self->_readline ) ) {
1297 # If Lambda/Kappa/Entropy numbers appear first at this point,
1298 # pushback and add the header line to process it correctly
1299 if (/^\s+[\d+\.]+\s+[\d+\.]+\s+[\d+\.]/ and $last eq '') {
1300 $self->_pushback($_);
1301 $self->_pushback("Lambda K H\n");
1302 next;
1304 elsif (
1305 /^((?:\S+)?BLAST[NPX]?)\s+(.+)$/i # NCBI BLAST, PSIBLAST
1306 # RPSBLAST, MEGABLAST
1307 || /^(P?GENEWISE|HFRAME|SWN|TSWN)\s+(.+)/i #Paracel BTK
1310 $self->_pushback($_);
1312 # let's handle this in the loop
1313 last;
1315 elsif (/^Query=/) {
1316 $self->_pushback($_);
1317 $self->_pushback($reportline) if $reportline;
1318 last PARSER;
1321 # here is where difference between wublast and ncbiblast
1322 # is better handled by different logic
1323 if ( /Number of Sequences:\s+([\d\,]+)/i
1324 || /of sequences in database:\s+(\-?[\d,]+)/i )
1326 my $c = $1;
1327 $c =~ s/\,//g;
1328 $self->element(
1330 'Name' => 'Statistics_db-len',
1331 'Data' => $c
1335 elsif (/letters in database:\s+(\-?[\d,]+)/i) {
1336 my $s = $1;
1337 $s =~ s/,//g;
1338 $self->element(
1340 'Name' => 'Statistics_db-let',
1341 'Data' => $s
1345 elsif ( $blast eq 'btk' ) {
1346 next;
1348 elsif ( $blast eq 'wublast' ) {
1350 # warn($_);
1351 if (/E=(\S+)/) {
1352 $self->element(
1354 'Name' => 'Parameters_expect',
1355 'Data' => $1
1359 elsif (/nogaps/) {
1360 $self->element(
1362 'Name' => 'Parameters_allowgaps',
1363 'Data' => 'no'
1367 elsif (/ctxfactor=(\S+)/) {
1368 $self->element(
1370 'Name' => 'Statistics_ctxfactor',
1371 'Data' => $1
1375 elsif (
1376 /(postsw|links|span[12]?|warnings|notes|gi|noseqs|qres|qype)/
1379 $self->element(
1381 'Name' => "Parameters_$1",
1382 'Data' => 'yes'
1386 elsif (/(\S+)=(\S+)/) {
1387 $self->element(
1389 'Name' => "Parameters_$1",
1390 'Data' => $2
1394 elsif ( $last =~ /(Frame|Strand)\s+MatID\s+Matrix name/i ) {
1395 my $firstgapinfo = 1;
1396 my $frame = undef;
1397 while ( defined($_) && !/^\s+$/ ) {
1398 s/^\s+//;
1399 s/\s+$//;
1400 if ( $firstgapinfo
1401 && s/Q=(\d+),R=(\d+)\s+//x )
1403 $firstgapinfo = 0;
1405 $self->element(
1407 'Name' => 'Parameters_gap-open',
1408 'Data' => $1
1411 $self->element(
1413 'Name' => 'Parameters_gap-extend',
1414 'Data' => $2
1417 my @fields = split;
1419 for my $type (
1420 qw(lambda_gapped
1421 kappa_gapped
1422 entropy_gapped)
1425 next if $type eq 'n/a';
1426 if ( !@fields ) {
1427 warn "fields is empty for $type\n";
1428 next;
1430 $self->element(
1432 'Name' =>
1433 "Statistics_frame$frame\_$type",
1434 'Data' => shift @fields
1439 else {
1440 my ( $frameo, $matid, $matrix, @fields ) =
1441 split;
1442 if ( !defined $frame ) {
1444 # keep some sort of default feature I guess
1445 # even though this is sort of wrong
1446 $self->element(
1448 'Name' => 'Parameters_matrix',
1449 'Data' => $matrix
1452 $self->element(
1454 'Name' => 'Statistics_lambda',
1455 'Data' => $fields[0]
1458 $self->element(
1460 'Name' => 'Statistics_kappa',
1461 'Data' => $fields[1]
1464 $self->element(
1466 'Name' => 'Statistics_entropy',
1467 'Data' => $fields[2]
1471 $frame = $frameo;
1472 my $ii = 0;
1473 for my $type (
1474 qw(lambda_used
1475 kappa_used
1476 entropy_used
1477 lambda_computed
1478 kappa_computed
1479 entropy_computed)
1482 my $f = $fields[$ii];
1483 next unless defined $f; # deal with n/a
1484 if ( $f eq 'same' ) {
1485 $f = $fields[ $ii - 3 ];
1487 $ii++;
1488 $self->element(
1490 'Name' =>
1491 "Statistics_frame$frame\_$type",
1492 'Data' => $f
1499 # get the next line
1500 $_ = $self->_readline;
1502 $last = $_;
1504 elsif ( $last =~ /(Frame|Strand)\s+MatID\s+Length/i ) {
1505 my $frame = undef;
1506 while ( defined($_) && !/^\s+/ ) {
1507 s/^\s+//;
1508 s/\s+$//;
1509 my @fields = split;
1510 if ( @fields <= 3 ) {
1511 for my $type (qw(X_gapped E2_gapped S2)) {
1512 last unless @fields;
1513 $self->element(
1515 'Name' =>
1516 "Statistics_frame$frame\_$type",
1517 'Data' => shift @fields
1522 else {
1524 for my $type (
1525 qw(length
1526 efflength
1527 E S W T X E2 S2)
1530 $self->element(
1532 'Name' =>
1533 "Statistics_frame$frame\_$type",
1534 'Data' => shift @fields
1539 $_ = $self->_readline;
1541 $last = $_;
1543 elsif (/(\S+\s+\S+)\s+DFA:\s+(\S+)\s+\((.+)\)/) {
1544 if ( $1 eq 'states in' ) {
1545 $self->element(
1547 'Name' => 'Statistics_DFA_states',
1548 'Data' => "$2 $3"
1552 elsif ( $1 eq 'size of' ) {
1553 $self->element(
1555 'Name' => 'Statistics_DFA_size',
1556 'Data' => "$2 $3"
1561 elsif (
1562 m/^\s+Time to generate neighborhood:\s+
1563 (\S+\s+\S+\s+\S+)/x
1566 $self->element(
1568 'Name' => 'Statistics_neighbortime',
1569 'Data' => $1
1573 elsif (/processors\s+used:\s+(\d+)/) {
1574 $self->element(
1576 'Name' => 'Statistics_noprocessors',
1577 'Data' => $1
1581 elsif (
1582 m/^\s+(\S+)\s+cpu\s+time:\s+ # cputype
1583 (\S+\s+\S+\s+\S+) # cputime
1584 \s+Elapsed:\s+(\S+)/x
1587 my $cputype = lc($1);
1588 $self->element(
1590 'Name' => "Statistics_$cputype\_cputime",
1591 'Data' => $2
1594 $self->element(
1596 'Name' => "Statistics_$cputype\_actualtime",
1597 'Data' => $3
1601 elsif (/^\s+Start:/) {
1602 my ( $junk, $start, $stime, $end, $etime ) =
1603 split( /\s+(Start|End)\:\s+/, $_ );
1604 chomp($stime);
1605 $self->element(
1607 'Name' => 'Statistics_starttime',
1608 'Data' => $stime
1611 chomp($etime);
1612 $self->element(
1614 'Name' => 'Statistics_endtime',
1615 'Data' => $etime
1619 elsif (/^\s+Database:\s+(.+)$/) {
1620 $self->element(
1622 'Name' => 'Parameters_full_dbpath',
1623 'Data' => $1
1628 elsif (/^\s+Posted:\s+(.+)/) {
1629 my $d = $1;
1630 chomp($d);
1631 $self->element(
1633 'Name' => 'Statistics_posted_date',
1634 'Data' => $d
1639 elsif ( $blast eq 'ncbi' ) {
1641 if (m/^Matrix:\s+(.+)\s*$/oxi) {
1642 $self->element(
1644 'Name' => 'Parameters_matrix',
1645 'Data' => $1
1649 elsif (/^Gapped/) {
1650 $gapped_stats = 1;
1652 elsif (/^Lambda/) {
1653 $_ = $self->_readline;
1654 s/^\s+//;
1655 my ( $lambda, $kappa, $entropy ) = split;
1656 if ($gapped_stats) {
1657 $self->element(
1659 'Name' => "Statistics_gapped_lambda",
1660 'Data' => $lambda
1663 $self->element(
1665 'Name' => "Statistics_gapped_kappa",
1666 'Data' => $kappa
1669 $self->element(
1671 'Name' => "Statistics_gapped_entropy",
1672 'Data' => $entropy
1676 else {
1677 $self->element(
1679 'Name' => "Statistics_lambda",
1680 'Data' => $lambda
1683 $self->element(
1685 'Name' => "Statistics_kappa",
1686 'Data' => $kappa
1689 $self->element(
1691 'Name' => "Statistics_entropy",
1692 'Data' => $entropy
1697 elsif (m/effective\s+search\s+space\s+used:\s+(\d+)/oxi) {
1698 $self->element(
1700 'Name' => 'Statistics_eff-spaceused',
1701 'Data' => $1
1705 elsif (m/effective\s+search\s+space:\s+(\d+)/oxi) {
1706 $self->element(
1708 'Name' => 'Statistics_eff-space',
1709 'Data' => $1
1713 elsif (
1714 m/Gap\s+Penalties:\s+Existence:\s+(\d+)\,
1715 \s+Extension:\s+(\d+)/ox
1718 $self->element(
1720 'Name' => 'Parameters_gap-open',
1721 'Data' => $1
1724 $self->element(
1726 'Name' => 'Parameters_gap-extend',
1727 'Data' => $2
1731 elsif (/effective\s+HSP\s+length:\s+(\d+)/) {
1732 $self->element(
1734 'Name' => 'Statistics_hsp-len',
1735 'Data' => $1
1739 elsif (/Number\s+of\s+HSP's\s+better\s+than\s+(\S+)\s+without\s+gapping:\s+(\d+)/) {
1740 $self->element(
1742 'Name' => 'Statistics_number_of_hsps_better_than_expect_value_cutoff_without_gapping',
1743 'Data' => $2
1747 elsif (/Number\s+of\s+HSP's\s+gapped:\s+(\d+)/) {
1748 $self->element(
1750 'Name' => 'Statistics_number_of_hsps_gapped',
1751 'Data' => $1
1755 elsif (/Number\s+of\s+HSP's\s+successfully\s+gapped:\s+(\d+)/) {
1756 $self->element(
1758 'Name' => 'Statistics_number_of_hsps_successfully_gapped',
1759 'Data' => $1
1763 elsif (/Length\s+adjustment:\s+(\d+)/) {
1764 $self->element(
1766 'Name' => 'Statistics_length_adjustment',
1767 'Data' => $1
1771 elsif (/effective\s+length\s+of\s+query:\s+([\d\,]+)/i) {
1772 my $c = $1;
1773 $c =~ s/\,//g;
1774 $self->element(
1776 'Name' => 'Statistics_query-len',
1777 'Data' => $c
1781 elsif (/effective\s+length\s+of\s+database:\s+([\d\,]+)/i) {
1782 my $c = $1;
1783 $c =~ s/\,//g;
1784 $self->element(
1786 'Name' => 'Statistics_eff-dblen',
1787 'Data' => $c
1791 elsif (
1792 /^(T|A|X1|X2|X3|S1|S2):\s+(\d+(\.\d+)?)\s+(?:\(\s*(\d+\.\d+) bits\))?/
1795 my $v = $2;
1796 chomp($v);
1797 $self->element(
1799 'Name' => "Statistics_$1",
1800 'Data' => $v
1803 if ( defined $4 ) {
1804 $self->element(
1806 'Name' => "Statistics_$1_bits",
1807 'Data' => $4
1812 elsif (
1813 m/frameshift\s+window\,
1814 \s+decay\s+const:\s+(\d+)\,\s+([\.\d]+)/x
1817 $self->element(
1819 'Name' => 'Statistics_framewindow',
1820 'Data' => $1
1823 $self->element(
1825 'Name' => 'Statistics_decay',
1826 'Data' => $2
1830 elsif (m/^Number\s+of\s+Hits\s+to\s+DB:\s+(\S+)/ox) {
1831 $self->element(
1833 'Name' => 'Statistics_hit_to_db',
1834 'Data' => $1
1838 elsif (m/^Number\s+of\s+extensions:\s+(\S+)/ox) {
1839 $self->element(
1841 'Name' => 'Statistics_num_extensions',
1842 'Data' => $1
1846 elsif (
1847 m/^Number\s+of\s+successful\s+extensions:\s+
1848 (\S+)/ox
1851 $self->element(
1853 'Name' => 'Statistics_num_suc_extensions',
1854 'Data' => $1
1858 elsif (
1859 m/^Number\s+of\s+sequences\s+better\s+than\s+
1860 (\S+):\s+(\d+)/ox
1863 $self->element(
1865 'Name' => 'Parameters_expect',
1866 'Data' => $1
1869 $self->element(
1871 'Name' => 'Statistics_seqs_better_than_cutoff',
1872 'Data' => $2
1876 elsif (/^\s+Posted\s+date:\s+(.+)/) {
1877 my $d = $1;
1878 chomp($d);
1879 $self->element(
1881 'Name' => 'Statistics_posted_date',
1882 'Data' => $d
1886 elsif ( !/^\s+$/ ) {
1887 #$self->debug( "unmatched stat $_");
1890 $last = $_;
1892 } elsif ( $self->in_element('hsp') ) {
1893 $self->debug("blast.pm: Processing HSP\n");
1894 # let's read 3 lines at a time;
1895 # bl2seq hackiness... Not sure I like
1896 $self->{'_reporttype'} ||= $DEFAULTREPORTTYPE;
1897 my %data = (
1898 'Query' => '',
1899 'Mid' => '',
1900 'Hit' => ''
1902 my $len;
1903 for ( my $i = 0 ; defined($_) && $i < 3 ; $i++ ) {
1904 # $self->debug("$i: $_") if $v;
1905 if ( ( $i == 0 && /^\s+$/) ||
1906 /^\s*(?:Lambda|Minus|Plus|Score)/i
1908 $self->_pushback($_) if defined $_;
1909 $self->end_element( { 'Name' => 'Hsp' } );
1910 last;
1912 chomp;
1913 if (/^((Query|Sbjct):?\s+(\-?\d+)?\s*)(\S+)\s+(\-?\d+)?/) {
1914 my ( $full, $type, $start, $str, $end ) =
1915 ( $1, $2, $3, $4, $5 );
1917 if ( $str eq '-' ) {
1918 $i = 3 if $type eq 'Sbjct';
1920 else {
1921 $data{$type} = $str;
1923 $len = length($full);
1924 $self->{"\_$type"}->{'begin'} = $start
1925 unless $self->{"_$type"}->{'begin'};
1926 $self->{"\_$type"}->{'end'} = $end;
1927 } elsif (/^((Query|Sbjct):?\s+(\-?0+)\s*)/) {
1928 # Bug from NCBI's BLAST: unaligned output
1929 $_ = $self->_readline() for 0..1;
1930 last;
1931 } else {
1932 $self->throw("no data for midline $_")
1933 unless ( defined $_ && defined $len );
1934 $data{'Mid'} = substr( $_, $len );
1936 $_ = $self->_readline();
1938 $self->characters(
1940 'Name' => 'Hsp_qseq',
1941 'Data' => $data{'Query'}
1944 $self->characters(
1946 'Name' => 'Hsp_hseq',
1947 'Data' => $data{'Sbjct'}
1950 $self->characters(
1952 'Name' => 'Hsp_midline',
1953 'Data' => $data{'Mid'}
1957 else {
1958 #$self->debug("blast.pm: unrecognized line $_");
1962 $self->debug("blast.pm: End of BlastOutput\n");
1963 if ( $self->{'_seentop'} ) {
1964 $self->within_element('hsp')
1965 && $self->end_element( { 'Name' => 'Hsp' } );
1966 $self->within_element('hit')
1967 && $self->end_element( { 'Name' => 'Hit' } );
1968 # cleanup extra hits
1969 $self->_cleanup_hits(\@hit_signifs) if scalar(@hit_signifs);
1970 $self->within_element('iteration')
1971 && $self->end_element( { 'Name' => 'Iteration' } );
1972 if ($bl2seq_fix) {
1973 $self->element(
1975 'Name' => 'BlastOutput_program',
1976 'Data' => $reporttype
1980 $self->end_element( { 'Name' => 'BlastOutput' } );
1982 return $self->end_document();
1985 # Private method for internal use only.
1986 sub _start_blastoutput {
1987 my $self = shift;
1988 $self->start_element( { 'Name' => 'BlastOutput' } );
1989 $self->{'_seentop'} = 1;
1990 $self->{'_result_count'}++;
1991 $self->{'_handler_rc'} = undef;
1994 =head2 _will_handle
1996 Title : _will_handle
1997 Usage : Private method. For internal use only.
1998 if( $self->_will_handle($type) ) { ... }
1999 Function: Provides an optimized way to check whether or not an element of a
2000 given type is to be handled.
2001 Returns : Reference to EventHandler object if the element type is to be handled.
2002 undef if the element type is not to be handled.
2003 Args : string containing type of element.
2005 Optimizations:
2007 =over 2
2009 =item 1
2011 Using the cached pointer to the EventHandler to minimize repeated
2012 lookups.
2014 =item 2
2016 Caching the will_handle status for each type that is encountered so
2017 that it only need be checked by calling
2018 handler-E<gt>will_handle($type) once.
2020 =back
2022 This does not lead to a major savings by itself (only 5-10%). In
2023 combination with other optimizations, or for large parse jobs, the
2024 savings good be significant.
2026 To test against the unoptimized version, remove the parentheses from
2027 around the third term in the ternary " ? : " operator and add two
2028 calls to $self-E<gt>_eventHandler().
2030 =cut
2032 sub _will_handle {
2033 my ( $self, $type ) = @_;
2034 my $handler = $self->{'_handler_cache'};
2035 my $will_handle =
2036 defined( $self->{'_will_handle_cache'}->{$type} )
2037 ? $self->{'_will_handle_cache'}->{$type}
2038 : ( $self->{'_will_handle_cache'}->{$type} =
2039 $handler->will_handle($type) );
2041 return $will_handle ? $handler : undef;
2044 =head2 start_element
2046 Title : start_element
2047 Usage : $eventgenerator->start_element
2048 Function: Handles a start element event
2049 Returns : none
2050 Args : hashref with at least 2 keys 'Data' and 'Name'
2052 =cut
2054 sub start_element {
2055 my ( $self, $data ) = @_;
2057 # we currently don't care about attributes
2058 my $nm = $data->{'Name'};
2059 my $type = $MODEMAP{$nm};
2060 if ($type) {
2061 my $handler = $self->_will_handle($type);
2062 if ($handler) {
2063 my $func = sprintf( "start_%s", lc $type );
2064 $self->{'_handler_rc'} = $handler->$func( $data->{'Attributes'} );
2066 #else {
2067 #$self->debug( # changed 4/29/2006 to play nice with other event handlers
2068 # "Bio::SearchIO::InternalParserError ".
2069 # "\nCan't handle elements of type \'$type.\'"
2072 unshift @{ $self->{'_elements'} }, $type;
2073 if ( $type eq 'result' ) {
2074 $self->{'_values'} = {};
2075 $self->{'_result'} = undef;
2076 } else {
2077 # cleanup some things
2078 if ( defined $self->{'_values'} ) {
2079 foreach my $k (
2080 grep { /^\U$type\-/ }
2081 keys %{ $self->{'_values'} }
2084 delete $self->{'_values'}->{$k};
2091 =head2 end_element
2093 Title : end_element
2094 Usage : $eventgenerator->end_element
2095 Function: Handles an end element event
2096 Returns : hashref with an element's worth of data
2097 Args : hashref with at least 2 keys 'Data' and 'Name'
2100 =cut
2102 sub end_element {
2103 my ( $self, $data ) = @_;
2105 my $nm = $data->{'Name'};
2106 my $type;
2107 my $rc;
2108 # cache these (TODO: we should probably cache all cross-report data)
2109 if ( $nm eq 'BlastOutput_program' ) {
2110 if ( $self->{'_last_data'} =~ /(t?blast[npx])/i ) {
2111 $self->{'_reporttype'} = uc $1;
2113 $self->{'_reporttype'} ||= $DEFAULTREPORTTYPE;
2116 if ( $nm eq 'BlastOutput_version' ) {
2117 $self->{'_reportversion'} = $self->{'_last_data'};
2120 # Hsps are sort of weird, in that they end when another
2121 # object begins so have to detect this in end_element for now
2122 if ( $nm eq 'Hsp' ) {
2123 foreach (qw(Hsp_qseq Hsp_midline Hsp_hseq Hsp_features)) {
2124 $self->element(
2126 'Name' => $_,
2127 'Data' => $self->{'_last_hspdata'}->{$_}
2129 ) if defined $self->{'_last_hspdata'}->{$_};
2131 $self->{'_last_hspdata'} = {};
2132 $self->element(
2134 'Name' => 'Hsp_query-from',
2135 'Data' => $self->{'_Query'}->{'begin'}
2138 $self->element(
2140 'Name' => 'Hsp_query-to',
2141 'Data' => $self->{'_Query'}->{'end'}
2145 $self->element(
2147 'Name' => 'Hsp_hit-from',
2148 'Data' => $self->{'_Sbjct'}->{'begin'}
2151 $self->element(
2153 'Name' => 'Hsp_hit-to',
2154 'Data' => $self->{'_Sbjct'}->{'end'}
2158 # } elsif( $nm eq 'Iteration' ) {
2159 # Nothing special needs to be done here.
2161 if ( $type = $MODEMAP{$nm} ) {
2162 my $handler = $self->_will_handle($type);
2163 if ($handler) {
2164 my $func = sprintf( "end_%s", lc $type );
2165 $rc = $handler->$func( $self->{'_reporttype'}, $self->{'_values'} );
2167 shift @{ $self->{'_elements'} };
2170 elsif ( $MAPPING{$nm} ) {
2171 if ( ref( $MAPPING{$nm} ) =~ /hash/i ) {
2173 # this is where we shove in the data from the
2174 # hashref info about params or statistics
2175 my $key = ( keys %{ $MAPPING{$nm} } )[0];
2176 $self->{'_values'}->{$key}->{ $MAPPING{$nm}->{$key} } =
2177 $self->{'_last_data'};
2179 else {
2180 $self->{'_values'}->{ $MAPPING{$nm} } = $self->{'_last_data'};
2183 else {
2184 #$self->debug("blast.pm: unknown nm $nm, ignoring\n");
2186 $self->{'_last_data'} = ''; # remove read data if we are at
2187 # end of an element
2188 $self->{'_result'} = $rc if ( defined $type && $type eq 'result' );
2189 $self->{'_seen_hsp_features'} = 0;
2190 return $rc;
2193 =head2 element
2195 Title : element
2196 Usage : $eventhandler->element({'Name' => $name, 'Data' => $str});
2197 Function: Convenience method that calls start_element, characters, end_element
2198 Returns : none
2199 Args : Hash ref with the keys 'Name' and 'Data'
2202 =cut
2204 sub element {
2205 my ( $self, $data ) = @_;
2206 # Note that start element isn't needed for character data
2207 # Not too SAX-y, though
2208 #$self->start_element($data);
2209 $self->characters($data);
2210 $self->end_element($data);
2213 =head2 characters
2215 Title : characters
2216 Usage : $eventgenerator->characters($str)
2217 Function: Send a character events
2218 Returns : none
2219 Args : string
2222 =cut
2224 sub characters {
2225 my ( $self, $data ) = @_;
2226 if ( $self->in_element('hsp')
2227 && $data->{'Name'} =~ /^Hsp\_(qseq|hseq|midline)$/ )
2229 $self->{'_last_hspdata'}->{ $data->{'Name'} } .= $data->{'Data'}
2230 if defined $data->{'Data'};
2232 return unless ( defined $data->{'Data'} && $data->{'Data'} !~ /^\s+$/ );
2233 $self->{'_last_data'} = $data->{'Data'};
2236 =head2 within_element
2238 Title : within_element
2239 Usage : if( $eventgenerator->within_element($element) ) {}
2240 Function: Test if we are within a particular element
2241 This is different than 'in' because within can be tested
2242 for a whole block.
2243 Returns : boolean
2244 Args : string element name
2246 See Also: L<in_element>
2248 =cut
2250 sub within_element {
2251 my ( $self, $name ) = @_;
2252 return 0
2253 if ( !defined $name && !defined $self->{'_elements'}
2254 || scalar @{ $self->{'_elements'} } == 0 );
2255 foreach ( @{ $self->{'_elements'} } ) {
2256 if ( $_ eq $name ) {
2257 return 1;
2260 return 0;
2263 =head2 in_element
2265 Title : in_element
2266 Usage : if( $eventgenerator->in_element($element) ) {}
2267 Function: Test if we are in a particular element
2268 This is different than 'within_element' because within
2269 can be tested for a whole block.
2270 Returns : boolean
2271 Args : string element name
2273 See Also: L<within_element>
2275 =cut
2277 sub in_element {
2278 my ( $self, $name ) = @_;
2279 return 0 if !defined $self->{'_elements'}->[0];
2280 return ( $self->{'_elements'}->[0] eq $name );
2283 =head2 start_document
2285 Title : start_document
2286 Usage : $eventgenerator->start_document
2287 Function: Handle a start document event
2288 Returns : none
2289 Args : none
2292 =cut
2294 sub start_document {
2295 my ($self) = @_;
2296 $self->{'_lasttype'} = '';
2297 $self->{'_values'} = {};
2298 $self->{'_result'} = undef;
2299 $self->{'_elements'} = [];
2302 =head2 end_document
2304 Title : end_document
2305 Usage : $eventgenerator->end_document
2306 Function: Handles an end document event
2307 Returns : Bio::Search::Result::ResultI object
2308 Args : none
2311 =cut
2313 sub end_document {
2314 my ( $self, @args ) = @_;
2316 #$self->debug("blast.pm: end_document\n");
2317 return $self->{'_result'};
2320 sub write_result {
2321 my ( $self, $blast, @args ) = @_;
2323 if ( not defined( $self->writer ) ) {
2324 $self->warn("Writer not defined. Using a $DEFAULT_BLAST_WRITER_CLASS");
2325 $self->writer( $DEFAULT_BLAST_WRITER_CLASS->new() );
2327 $self->SUPER::write_result( $blast, @args );
2330 sub result_count {
2331 my $self = shift;
2332 return $self->{'_result_count'};
2335 sub report_count { shift->result_count }
2337 =head2 inclusion_threshold
2339 Title : inclusion_threshold
2340 Usage : my $incl_thresh = $isreb->inclusion_threshold;
2341 : $isreb->inclusion_threshold(1e-5);
2342 Function: Get/Set the e-value threshold for inclusion in the PSI-BLAST
2343 score matrix model (blastpgp) that was used for generating the reports
2344 being parsed.
2345 Returns : number (real)
2346 Default value: $Bio::SearchIO::IteratedSearchResultEventBuilder::DEFAULT_INCLUSION_THRESHOLD
2347 Args : number (real) (e.g., 0.0001 or 1e-4 )
2349 =cut
2351 =head2 max_significance
2353 Usage : $obj->max_significance();
2354 Purpose : Set/Get the P or Expect value used as significance screening cutoff.
2355 This is the value of the -signif parameter supplied to new().
2356 Hits with P or E-value above this are skipped.
2357 Returns : Scientific notation number with this format: 1.0e-05.
2358 Argument : Scientific notation number or float (when setting)
2359 Comments : Screening of significant hits uses the data provided on the
2360 : description line. For NCBI BLAST1 and WU-BLAST, this data
2361 : is P-value. for NCBI BLAST2 it is an Expect value.
2363 =cut
2365 =head2 signif
2367 Synonym for L<max_significance()|max_significance>
2369 =cut
2371 =head2 min_score
2373 Usage : $obj->min_score();
2374 Purpose : Set/Get the Blast score used as screening cutoff.
2375 This is the value of the -score parameter supplied to new().
2376 Hits with scores below this are skipped.
2377 Returns : Integer or scientific notation number.
2378 Argument : Integer or scientific notation number (when setting)
2379 Comments : Screening of significant hits uses the data provided on the
2380 : description line.
2382 =cut
2384 =head2 min_query_length
2386 Usage : $obj->min_query_length();
2387 Purpose : Gets the query sequence length used as screening criteria.
2388 This is the value of the -min_query_len parameter supplied to new().
2389 Hits with sequence length below this are skipped.
2390 Returns : Integer
2391 Argument : n/a
2393 =cut
2395 =head2 best_hit_only
2397 Title : best_hit_only
2398 Usage : print "only getting best hit.\n" if $obj->best_hit_only;
2399 Purpose : Set/Get the indicator for whether or not to process only
2400 : the best BlastHit.
2401 Returns : Boolean (1 | 0)
2402 Argument : Boolean (1 | 0) (when setting)
2404 =cut
2406 =head2 check_all_hits
2408 Title : check_all_hits
2409 Usage : print "checking all hits.\n" if $obj->check_all_hits;
2410 Purpose : Set/Get the indicator for whether or not to process all hits.
2411 : If false, the parser will stop processing hits after the
2412 : the first non-significance hit or the first hit that fails
2413 : any hit filter.
2414 Returns : Boolean (1 | 0)
2415 Argument : Boolean (1 | 0) (when setting)
2417 =cut
2419 # general private method used to make minimal hits from leftover
2420 # data in the hit table
2422 sub _cleanup_hits {
2423 my ($self, $hits) = @_;
2424 while ( my $v = shift @{ $hits }) {
2425 next unless defined $v;
2426 $self->start_element( { 'Name' => 'Hit' } );
2427 my $id = $v->[2];
2428 my $desc = $v->[3];
2429 $self->element(
2431 'Name' => 'Hit_id',
2432 'Data' => $id
2435 my ($gi, $acc, $version ) = $self->_get_seq_identifiers($id);
2436 $self->element(
2438 'Name' => 'Hit_accession',
2439 'Data' => $acc
2442 if ( defined $v ) {
2443 $self->element(
2445 'Name' => 'Hit_signif',
2446 'Data' => $v->[0]
2449 if (exists $self->{'_wublast'}) {
2450 $self->element(
2452 'Name' => 'Hit_score',
2453 'Data' => $v->[1]
2456 } else {
2457 $self->element(
2459 'Name' => 'Hit_bits',
2460 'Data' => $v->[1]
2466 $self->element(
2468 'Name' => 'Hit_def',
2469 'Data' => $desc
2472 $self->end_element( { 'Name' => 'Hit' } );
2479 __END__
2481 Developer Notes
2482 ---------------
2484 The following information is added in hopes of increasing the
2485 maintainability of this code. It runs the risk of becoming obsolete as
2486 the code gets updated. As always, double check against the actual
2487 source. If you find any discrepencies, please correct them.
2488 [ This documentation added on 3 Jun 2003. ]
2490 The logic is the brainchild of Jason Stajich, documented by Steve
2491 Chervitz. Jason: please check it over and modify as you see fit.
2493 Question:
2494 Elmo wants to know: How does this module unmarshall data from the input stream?
2495 (i.e., how does information from a raw input file get added to
2496 the correct Bioperl object?)
2498 Answer:
2500 This answer is specific to SearchIO::blast, but may apply to other
2501 SearchIO.pm subclasses as well. The following description gives the
2502 basic idea. The actual processing is a little more complex for
2503 certain types of data (HSP, Report Parameters).
2505 You can think of blast::next_result() as faking a SAX XML parser,
2506 making a non-XML document behave like its XML. The overhead to do this
2507 is quite substantial (~650 lines of code instead of ~80 in
2508 blastxml.pm).
2510 0. First, add a key => value pair for the datum of interest to %MAPPING
2511 Example:
2512 'Foo_bar' => 'Foo-bar',
2514 1. next_result() collects the datum of interest from the input stream,
2515 and calls element().
2516 Example:
2517 $self->element({ 'Name' => 'Foo_bar',
2518 'Data' => $foobar});
2520 2. The element() method is a convenience method that calls start_element(),
2521 characters(), and end_element().
2523 3. start_element() checks to see if the event handler can handle a start_xxx(),
2524 where xxx = the 'Name' parameter passed into element(), and calls start_xxx()
2525 if so. Otherwise, start_element() does not do anything.
2527 Data that will have such an event handler are defined in %MODEMAP.
2528 Typically, there are only handler methods for the main parts of
2529 the search result (e.g., Result, Iteration, Hit, HSP),
2530 which have corresponding Bioperl modules. So in this example,
2531 there was an earlier call such as $self->element({'Name'=>'Foo'})
2532 and the Foo_bar datum is meant to ultimately go into a Foo object.
2534 The start_foo() method in the handler will typically do any
2535 data initialization necessary to prepare for creating a new Foo object.
2536 Example: SearchResultEventBuilder::start_result()
2538 4. characters() takes the value of the 'Data' key from the hashref argument in
2539 the elements() call and saves it in a local data member:
2540 Example:
2541 $self->{'_last_data'} = $data->{'Data'};
2543 5. end_element() is like start_element() in that it does the check for whether
2544 the event handler can handle end_xxx() and if so, calls it, passing in
2545 the data collected from all of the characters() calls that occurred
2546 since the start_xxx() call.
2548 If there isn't any special handler for the data type specified by 'Name',
2549 end_element() will place the data saved by characters() into another
2550 local data member that saves it in a hash with a key defined by %MAPPING.
2551 Example:
2552 $nm = $data->{'Name'};
2553 $self->{'_values'}->{$MAPPING{$nm}} = $self->{'_last_data'};
2555 In this case, $MAPPING{$nm} is 'Foo-bar'.
2557 end_element() finishes by resetting the local data member used by
2558 characters(). (i.e., $self->{'_last_data'} = '';)
2560 6. When the next_result() method encounters the end of the Foo element in the
2561 input stream. It will invoke $self->end_element({'Name'=>'Foo'}).
2562 end_element() then sends all of the data in the $self->{'_values'} hash.
2563 Note that $self->{'_values'} is cleaned out during start_element(),
2564 keeping it at a resonable size.
2566 In the event handler, the end_foo() method takes the hash from end_element()
2567 and creates a new hash containing the same data, but having keys lacking
2568 the 'Foo' prefix (e.g., 'Foo-bar' becomes '-bar'). The handler's end_foo()
2569 method then creates the Foo object, passing in this new hash as an argument.
2570 Example: SearchResultEventBuilder::end_result()
2572 7. Objects created from the data in the search result are managed by
2573 the event handler which adds them to a ResultI object (using API methods
2574 for that object). The ResultI object gets passed back to
2575 SearchIO::end_element() when it calls end_result().
2577 The ResultI object is then saved in an internal data member of the
2578 SearchIO object, which returns it at the end of next_result()
2579 by calling end_document().
2581 (Technical Note: All objects created by end_xxx() methods in the event
2582 handler are returned to SearchIO::end_element(), but the SearchIO object
2583 only cares about the ResultI objects.)
2585 (Sesame Street aficionados note: This answer was NOT given by Mr. Noodle ;-P)