Hmmer3: Final fix for Bug #3369. Eliminated the removal of
[bioperl-live.git] / Bio / SearchIO / blast.pm
blobb06df23ff02bb11561a95bccab1833d2271f2408
2 # BioPerl module for Bio::SearchIO::blast
4 # Please direct questions and support issues to <bioperl-l@bioperl.org>
6 # Cared for by Jason Stajich <jason@bioperl.org>
8 # Copyright Jason Stajich
10 # You may distribute this module under the same terms as perl itself
12 # POD documentation - main docs before the code
14 # 20030409 - sac
15 # PSI-BLAST full parsing support. Rollout of new
16 # model which will remove Steve's old psiblast driver
17 # 20030424 - jason
18 # Megablast parsing fix as reported by Neil Saunders
19 # 20030427 - jason
20 # Support bl2seq parsing
21 # 20031124 - jason
22 # Parse more blast statistics, lambda, entropy, etc
23 # from WU-BLAST in frame-specific manner
24 # 20060216 - cjf - fixed blast parsing for BLAST v2.2.13 output
25 # 20071104 - dmessina - added support for WUBLAST -echofilter
26 # 20071121 - cjf - fixed several bugs (bugs 2391, 2399, 2409)
28 =head1 NAME
30 Bio::SearchIO::blast - Event generator for event based parsing of
31 blast reports
33 =head1 SYNOPSIS
35 # Do not use this object directly - it is used as part of the
36 # Bio::SearchIO system.
38 use Bio::SearchIO;
39 my $searchio = Bio::SearchIO->new(-format => 'blast',
40 -file => 't/data/ecolitst.bls');
41 while( my $result = $searchio->next_result ) {
42 while( my $hit = $result->next_hit ) {
43 while( my $hsp = $hit->next_hsp ) {
44 # ...
49 =head1 DESCRIPTION
51 This object encapsulated the necessary methods for generating events
52 suitable for building Bio::Search objects from a BLAST report file.
53 Read the L<Bio::SearchIO> for more information about how to use this.
55 This driver can parse:
57 =over 4
59 =item *
61 NCBI produced plain text BLAST reports from blastall, this also
62 includes PSIBLAST, PSITBLASTN, RPSBLAST, and bl2seq reports. NCBI XML
63 BLAST output is parsed with the blastxml SearchIO driver
65 =item *
67 WU-BLAST all reports
69 =item *
71 Jim Kent's BLAST-like output from his programs (BLASTZ, BLAT)
73 =item *
75 BLAST-like output from Paracel BTK output
77 =back
79 =head2 bl2seq parsing
81 Since I cannot differentiate between BLASTX and TBLASTN since bl2seq
82 doesn't report the algorithm used - I assume it is BLASTX by default -
83 you can supply the program type with -report_type in the SearchIO
84 constructor i.e.
86 my $parser = Bio::SearchIO->new(-format => 'blast',
87 -file => 'bl2seq.tblastn.report',
88 -report_type => 'tblastn');
90 This only really affects where the frame and strand information are
91 put - they will always be on the $hsp-E<gt>query instead of on the
92 $hsp-E<gt>hit part of the feature pair for blastx and tblastn bl2seq
93 produced reports. Hope that's clear...
95 =head1 FEEDBACK
97 =head2 Mailing Lists
99 User feedback is an integral part of the evolution of this and other
100 Bioperl modules. Send your comments and suggestions preferably to
101 the Bioperl mailing list. Your participation is much appreciated.
103 bioperl-l@bioperl.org - General discussion
104 http://bioperl.org/wiki/Mailing_lists - About the mailing lists
106 =head2 Support
108 Please direct usage questions or support issues to the mailing list:
110 I<bioperl-l@bioperl.org>
112 rather than to the module maintainer directly. Many experienced and
113 reponsive experts will be able look at the problem and quickly
114 address it. Please include a thorough description of the problem
115 with code and data examples if at all possible.
117 =head2 Reporting Bugs
119 Report bugs to the Bioperl bug tracking system to help us keep track
120 of the bugs and their resolution. Bug reports can be submitted via the
121 web:
123 https://redmine.open-bio.org/projects/bioperl/
125 =head1 AUTHOR - Jason Stajich
127 Email Jason Stajich jason-at-bioperl.org
129 =head1 CONTRIBUTORS
131 Steve Chervitz sac-at-bioperl.org
133 =head1 APPENDIX
135 The rest of the documentation details each of the object methods.
136 Internal methods are usually preceded with a _
138 =cut
140 # Let the code begin...'
142 package Bio::SearchIO::blast;
144 use Bio::SearchIO::IteratedSearchResultEventBuilder;
145 use strict;
146 use vars qw(%MAPPING %MODEMAP
147 $DEFAULT_BLAST_WRITER_CLASS
148 $MAX_HSP_OVERLAP
149 $DEFAULT_SIGNIF
150 $DEFAULT_SCORE
151 $DEFAULTREPORTTYPE
155 use base qw(Bio::SearchIO);
156 use Data::Dumper;
158 BEGIN {
160 # mapping of NCBI Blast terms to Bioperl hash keys
161 %MODEMAP = (
162 'BlastOutput' => 'result',
163 'Iteration' => 'iteration',
164 'Hit' => 'hit',
165 'Hsp' => 'hsp'
168 # This should really be done more intelligently, like with
169 # XSLT
171 %MAPPING = (
172 'Hsp_bit-score' => 'HSP-bits',
173 'Hsp_score' => 'HSP-score',
174 'Hsp_evalue' => 'HSP-evalue',
175 'Hsp_n', => 'HSP-n',
176 'Hsp_pvalue' => 'HSP-pvalue',
177 'Hsp_query-from' => 'HSP-query_start',
178 'Hsp_query-to' => 'HSP-query_end',
179 'Hsp_hit-from' => 'HSP-hit_start',
180 'Hsp_hit-to' => 'HSP-hit_end',
181 'Hsp_positive' => 'HSP-conserved',
182 'Hsp_identity' => 'HSP-identical',
183 'Hsp_gaps' => 'HSP-hsp_gaps',
184 'Hsp_hitgaps' => 'HSP-hit_gaps',
185 'Hsp_querygaps' => 'HSP-query_gaps',
186 'Hsp_qseq' => 'HSP-query_seq',
187 'Hsp_hseq' => 'HSP-hit_seq',
188 'Hsp_midline' => 'HSP-homology_seq',
189 'Hsp_align-len' => 'HSP-hsp_length',
190 'Hsp_query-frame' => 'HSP-query_frame',
191 'Hsp_hit-frame' => 'HSP-hit_frame',
192 'Hsp_links' => 'HSP-links',
193 'Hsp_group' => 'HSP-hsp_group',
194 'Hsp_features' => 'HSP-hit_features',
196 'Hit_id' => 'HIT-name',
197 'Hit_len' => 'HIT-length',
198 'Hit_accession' => 'HIT-accession',
199 'Hit_def' => 'HIT-description',
200 'Hit_signif' => 'HIT-significance',
201 # For NCBI blast, the description line contains bits.
202 # For WU-blast, the description line contains score.
203 'Hit_score' => 'HIT-score',
204 'Hit_bits' => 'HIT-bits',
206 'Iteration_iter-num' => 'ITERATION-number',
207 'Iteration_converged' => 'ITERATION-converged',
209 'BlastOutput_program' => 'RESULT-algorithm_name',
210 'BlastOutput_version' => 'RESULT-algorithm_version',
211 'BlastOutput_algorithm-reference' => 'RESULT-algorithm_reference',
212 'BlastOutput_rid' => 'RESULT-rid',
213 'BlastOutput_query-def' => 'RESULT-query_name',
214 'BlastOutput_query-len' => 'RESULT-query_length',
215 'BlastOutput_query-acc' => 'RESULT-query_accession',
216 'BlastOutput_query-gi' => 'RESULT-query_gi',
217 'BlastOutput_querydesc' => 'RESULT-query_description',
218 'BlastOutput_db' => 'RESULT-database_name',
219 'BlastOutput_db-len' => 'RESULT-database_entries',
220 'BlastOutput_db-let' => 'RESULT-database_letters',
221 'BlastOutput_inclusion-threshold' => 'RESULT-inclusion_threshold',
223 'Parameters_matrix' => { 'RESULT-parameters' => 'matrix' },
224 'Parameters_expect' => { 'RESULT-parameters' => 'expect' },
225 'Parameters_include' => { 'RESULT-parameters' => 'include' },
226 'Parameters_sc-match' => { 'RESULT-parameters' => 'match' },
227 'Parameters_sc-mismatch' => { 'RESULT-parameters' => 'mismatch' },
228 'Parameters_gap-open' => { 'RESULT-parameters' => 'gapopen' },
229 'Parameters_gap-extend' => { 'RESULT-parameters' => 'gapext' },
230 'Parameters_filter' => { 'RESULT-parameters' => 'filter' },
231 'Parameters_allowgaps' => { 'RESULT-parameters' => 'allowgaps' },
232 'Parameters_full_dbpath' => { 'RESULT-parameters' => 'full_dbpath' },
233 'Statistics_db-len' => { 'RESULT-statistics' => 'dbentries' },
234 'Statistics_db-let' => { 'RESULT-statistics' => 'dbletters' },
235 'Statistics_hsp-len' =>
236 { 'RESULT-statistics' => 'effective_hsplength' },
237 'Statistics_query-len' => { 'RESULT-statistics' => 'querylength' },
238 'Statistics_eff-space' => { 'RESULT-statistics' => 'effectivespace' },
239 'Statistics_eff-spaceused' =>
240 { 'RESULT-statistics' => 'effectivespaceused' },
241 'Statistics_eff-dblen' =>
242 { 'RESULT-statistics' => 'effectivedblength' },
243 'Statistics_kappa' => { 'RESULT-statistics' => 'kappa' },
244 'Statistics_lambda' => { 'RESULT-statistics' => 'lambda' },
245 'Statistics_entropy' => { 'RESULT-statistics' => 'entropy' },
246 'Statistics_gapped_kappa' => { 'RESULT-statistics' => 'kappa_gapped' },
247 'Statistics_gapped_lambda' =>
248 { 'RESULT-statistics' => 'lambda_gapped' },
249 'Statistics_gapped_entropy' =>
250 { 'RESULT-statistics' => 'entropy_gapped' },
252 'Statistics_framewindow' =>
253 { 'RESULT-statistics' => 'frameshiftwindow' },
254 'Statistics_decay' => { 'RESULT-statistics' => 'decayconst' },
256 'Statistics_hit_to_db' => { 'RESULT-statistics' => 'Hits_to_DB' },
257 'Statistics_num_suc_extensions' =>
258 { 'RESULT-statistics' => 'num_successful_extensions' },
259 'Statistics_length_adjustment' => { 'RESULT-statistics' => 'length_adjustment' },
260 'Statistics_number_of_hsps_better_than_expect_value_cutoff_without_gapping' =>
261 { 'RESULT-statistics' => 'number_of_hsps_better_than_expect_value_cutoff_without_gapping' },
262 'Statistics_number_of_hsps_gapped' => { 'RESULT-statistics' => 'number_of_hsps_gapped' },
263 'Statistics_number_of_hsps_successfully_gapped' => { 'RESULT-statistics' => 'number_of_hsps_successfully_gapped' },
265 # WU-BLAST stats
266 'Statistics_DFA_states' => { 'RESULT-statistics' => 'num_dfa_states' },
267 'Statistics_DFA_size' => { 'RESULT-statistics' => 'dfa_size' },
268 'Statistics_noprocessors' =>
269 { 'RESULT-statistics' => 'no_of_processors' },
270 'Statistics_neighbortime' =>
271 { 'RESULT-statistics' => 'neighborhood_generate_time' },
272 'Statistics_starttime' => { 'RESULT-statistics' => 'start_time' },
273 'Statistics_endtime' => { 'RESULT-statistics' => 'end_time' },
276 # add WU-BLAST Frame-Based Statistics
277 for my $frame ( 0 .. 3 ) {
278 for my $strand ( '+', '-' ) {
279 for my $ind (
280 qw(length efflength E S W T X X_gapped E2
281 E2_gapped S2)
284 $MAPPING{"Statistics_frame$strand$frame\_$ind"} =
285 { 'RESULT-statistics' => "Frame$strand$frame\_$ind" };
287 for my $val (qw(lambda kappa entropy )) {
288 for my $type (qw(used computed gapped)) {
289 my $key = "Statistics_frame$strand$frame\_$val\_$type";
290 my $val =
291 { 'RESULT-statistics' =>
292 "Frame$strand$frame\_$val\_$type" };
293 $MAPPING{$key} = $val;
299 # add Statistics
300 for my $stats (
301 qw(T A X1 X2 X3 S1 S2 X1_bits X2_bits X3_bits
302 S1_bits S2_bits num_extensions
303 num_successful_extensions
304 seqs_better_than_cutoff
305 posted_date
306 search_cputime total_cputime
307 search_actualtime total_actualtime
308 no_of_processors ctxfactor)
311 my $key = "Statistics_$stats";
312 my $val = { 'RESULT-statistics' => $stats };
313 $MAPPING{$key} = $val;
316 # add WU-BLAST Parameters
317 for my $param (
318 qw(span span1 span2 links warnings notes hspsepsmax
319 hspsepqmax topcomboN topcomboE postsw cpus wordmask
320 filter sort_by_pvalue sort_by_count sort_by_highscore
321 sort_by_totalscore sort_by_subjectlength noseqs gi qtype
322 qres V B Z Y M N)
325 my $key = "Parameters_$param";
326 my $val = { 'RESULT-parameters' => $param };
327 $MAPPING{$key} = $val;
330 $DEFAULT_BLAST_WRITER_CLASS = 'Bio::SearchIO::Writer::HitTableWriter';
331 $MAX_HSP_OVERLAP = 2; # Used when tiling multiple HSPs.
332 $DEFAULTREPORTTYPE = 'BLASTP'; # for bl2seq
335 =head2 new
337 Title : new
338 Usage : my $obj = Bio::SearchIO::blast->new(%args);
339 Function: Builds a new Bio::SearchIO::blast object
340 Returns : Bio::SearchIO::blast
341 Args : Key-value pairs:
342 -fh/-file => filehandle/filename to BLAST file
343 -format => 'blast'
344 -report_type => 'blastx', 'tblastn', etc -- only for bl2seq
345 reports when you want to distinguish between
346 tblastn and blastx reports (this only controls
347 where the frame information is put - on the query
348 or subject object.
349 -inclusion_threshold => e-value threshold for inclusion in the
350 PSI-BLAST score matrix model (blastpgp)
351 -signif => float or scientific notation number to be used
352 as a P- or Expect value cutoff
353 -score => integer or scientific notation number to be used
354 as a blast score value cutoff
355 -bits => integer or scientific notation number to be used
356 as a bit score value cutoff
357 -hit_filter => reference to a function to be used for
358 filtering hits based on arbitrary criteria.
359 All hits of each BLAST report must satisfy
360 this criteria to be retained.
361 If a hit fails this test, it is ignored.
362 This function should take a
363 Bio::Search::Hit::BlastHit.pm object as its first
364 argument and return true
365 if the hit should be retained.
366 Sample filter function:
367 -hit_filter => sub { $hit = shift;
368 $hit->gaps == 0; },
369 (Note: -filt_func is synonymous with -hit_filter)
370 -overlap => integer. The amount of overlap to permit between
371 adjacent HSPs when tiling HSPs. A reasonable value is 2.
372 Default = $Bio::SearchIO::blast::MAX_HSP_OVERLAP.
374 The following criteria are not yet supported:
375 (these are probably best applied within this module rather than in the
376 event handler since they would permit the parser to take some shortcuts.)
378 -check_all_hits => boolean. Check all hits for significance against
379 significance criteria. Default = false.
380 If false, stops processing hits after the first
381 non-significant hit or the first hit that fails
382 the hit_filter call. This speeds parsing,
383 taking advantage of the fact that the hits
384 are processed in the order they appear in the report.
385 -min_query_len => integer to be used as a minimum for query sequence length.
386 Reports with query sequences below this length will
387 not be processed. Default = no minimum length.
388 -best => boolean. Only process the best hit of each report;
389 default = false.
391 =cut
393 sub _initialize {
394 my ( $self, @args ) = @_;
395 $self->SUPER::_initialize(@args);
397 # Blast reports require a specialized version of the SREB due to the
398 # possibility of iterations (PSI-BLAST). Forwarding all arguments to it. An
399 # issue here is that we want to set new default object factories if none are
400 # supplied.
402 my $handler = Bio::SearchIO::IteratedSearchResultEventBuilder->new(@args);
403 $self->attach_EventHandler($handler);
405 # 2006-04-26 move this to the attach_handler function in this module so we
406 # can really reset the handler
407 # Optimization: caching
408 # the EventHandler since it is used a lot during the parse.
410 # $self->{'_handler_cache'} = $handler;
412 my ($rpttype ) = $self->_rearrange(
415 REPORT_TYPE)
417 @args
419 defined $rpttype && ( $self->{'_reporttype'} = $rpttype );
422 sub attach_EventHandler {
423 my ($self,$handler) = @_;
425 $self->SUPER::attach_EventHandler($handler);
427 # Optimization: caching the EventHandler since it is used a lot
428 # during the parse.
430 $self->{'_handler_cache'} = $handler;
431 return;
434 =head2 next_result
436 Title : next_result
437 Usage : my $hit = $searchio->next_result;
438 Function: Returns the next Result from a search
439 Returns : Bio::Search::Result::ResultI object
440 Args : none
442 =cut
444 sub next_result {
445 my ($self) = @_;
446 my $v = $self->verbose;
447 my $data = '';
448 my $flavor = '';
449 $self->{'_seentop'} = 0; # start next report at top
451 my ( $reporttype, $seenquery, $reportline, $reportversion );
452 my ( $seeniteration, $found_again );
453 my $incl_threshold = $self->inclusion_threshold;
454 my $bl2seq_fix;
455 $self->start_document(); # let the fun begin...
456 my (@hit_signifs);
457 my $gapped_stats = 0; # for switching between gapped/ungapped
458 # lambda, K, H
459 local $_ = "\n"; #consistency
460 PARSER:
461 while ( defined( $_ = $self->_readline ) ) {
462 next if (/^\s+$/); # skip empty lines
463 next if (/CPU time:/);
464 next if (/^>\s*$/);
465 next if (/[*]+\s+No hits found\s+[*]+/);
466 if (
467 /^((?:\S+?)?BLAST[NPX]?)\s+(.+)$/i # NCBI BLAST, PSIBLAST
468 # RPSBLAST, MEGABLAST
469 || /^(P?GENEWISE|HFRAME|SWN|TSWN)\s+(.+)/i #Paracel BTK
472 ($reporttype, $reportversion) = ($1, $2);
473 # need to keep track of whether this is WU-BLAST
474 if ($reportversion && $reportversion =~ m{WashU$}) {
475 $self->{'_wublast'}++;
477 $self->debug("blast.pm: Start of new report: $reporttype, $reportversion\n");
478 if ( $self->{'_seentop'} ) {
479 # This handles multi-result input streams
480 $self->_pushback($_);
481 last PARSER;
483 $self->_start_blastoutput;
484 if ($reporttype =~ /RPS-BLAST/) {
485 $reporttype .= '(BLASTP)'; # default RPS-BLAST type
487 $reportline = $_; # to fix the fact that RPS-BLAST output is wrong
488 $self->element(
490 'Name' => 'BlastOutput_program',
491 'Data' => $reporttype
495 $self->element(
497 'Name' => 'BlastOutput_version',
498 'Data' => $reportversion
501 $self->element(
503 'Name' => 'BlastOutput_inclusion-threshold',
504 'Data' => $incl_threshold
508 # parse the BLAST algorithm reference
509 elsif(/^Reference:\s+(.*)$/) {
510 # want to preserve newlines for the BLAST algorithm reference
511 my $algorithm_reference = "$1\n";
512 $_ = $self->_readline;
513 # while the current line, does not match an empty line, a RID:, a
514 # Database:, or a query definition line (Query=) we are still
515 # looking at the algorithm_reference, append it to what we parsed so
516 # far
517 while($_ !~ /^$/ && $_ !~ /^RID:/ && $_ !~ /^Database:/ && $_ !~ /^Query=/) {
518 $algorithm_reference .= "$_";
519 $_ = $self->_readline;
521 # if we exited the while loop, we saw an empty line, a RID:, or a
522 # Database:, so push it back
523 $self->_pushback($_);
524 $self->element(
526 'Name' => 'BlastOutput_algorithm-reference',
527 'Data' => $algorithm_reference
531 # parse BLAST RID (Request ID)
532 elsif(/^RID:\s+(.*)$/) {
533 my $rid = $1;
534 $self->element(
536 'Name' => 'BlastOutput_rid',
537 'Data' => $rid
541 # added Windows workaround for bug 1985
542 elsif (/^(Searching|Results from round)/) {
543 next unless $1 =~ /Results from round/;
544 $self->debug("blast.pm: Possible psi blast iterations found...\n");
546 $self->in_element('hsp')
547 && $self->end_element( { 'Name' => 'Hsp' } );
548 $self->in_element('hit')
549 && $self->end_element( { 'Name' => 'Hit' } );
550 if ( defined $seeniteration ) {
551 $self->within_element('iteration')
552 && $self->end_element( { 'Name' => 'Iteration' } );
553 $self->start_element( { 'Name' => 'Iteration' } );
555 else {
556 $self->start_element( { 'Name' => 'Iteration' } );
558 $seeniteration = 1;
560 elsif (/^Query=\s*(.*)$/) {
561 my $q = $1;
562 $self->debug("blast.pm: Query= found...$_\n");
563 my $size = 0;
564 if ( defined $seenquery ) {
565 $self->_pushback($_);
566 $self->_pushback($reportline) if $reportline;
567 last PARSER;
569 if ( !defined $reporttype ) {
570 $self->_start_blastoutput;
571 if ( defined $seeniteration ) {
572 $self->in_element('iteration')
573 && $self->end_element( { 'Name' => 'Iteration' } );
574 $self->start_element( { 'Name' => 'Iteration' } );
576 else {
577 $self->start_element( { 'Name' => 'Iteration' } );
579 $seeniteration = 1;
581 $seenquery = $q;
582 $_ = $self->_readline;
583 while ( defined($_) ) {
584 if (/^Database:/) {
585 $self->_pushback($_);
586 last;
588 # below line fixes length issue with BLAST v2.2.13; still works
589 # with BLAST v2.2.12
590 if ( /\((\-?[\d,]+)\s+letters.*\)/ || /^Length=(\-?[\d,]+)/ ) {
591 $size = $1;
592 $size =~ s/,//g;
593 last;
595 else {
596 # bug 2391
597 $q .= ($q =~ /\w$/ && $_ =~ /^\w/) ? " $_" : $_;
598 $q =~ s/\s+/ /g; # this catches the newline as well
599 $q =~ s/^ | $//g;
602 $_ = $self->_readline;
604 chomp($q);
605 my ( $nm, $desc ) = split( /\s+/, $q, 2 );
606 $self->element(
608 'Name' => 'BlastOutput_query-def',
609 'Data' => $nm
611 ) if $nm;
612 $self->element(
614 'Name' => 'BlastOutput_query-len',
615 'Data' => $size
618 defined $desc && $desc =~ s/\s+$//;
619 $self->element(
621 'Name' => 'BlastOutput_querydesc',
622 'Data' => $desc
625 my ( $gi, $acc, $version ) = $self->_get_seq_identifiers($nm);
626 $version = defined($version) && length($version) ? ".$version" : "";
627 $self->element(
629 'Name' => 'BlastOutput_query-acc',
630 'Data' => "$acc$version"
632 ) if $acc;
634 # these elements are dropped with some multiquery reports; add
635 # back here
636 $self->element(
638 'Name' => 'BlastOutput_db-len',
639 'Data' => $self->{'_blsdb_length'}
641 ) if $self->{'_blsdb_length'};
642 $self->element(
644 'Name' => 'BlastOutput_db-let',
645 'Data' => $self->{'_blsdb_letters'}
647 ) if $self->{'_blsdb_letters'};
648 $self->element(
650 'Name' => 'BlastOutput_db',
651 'Data' => $self->{'_blsdb'}
653 ) if $self->{'_blsdb_letters'};
655 # added check for WU-BLAST -echofilter option (bug 2388)
656 elsif (/^>Unfiltered[+-]1$/) {
657 # skip all of the lines of unfiltered sequence
658 while($_ !~ /^Database:/) {
659 $self->debug("Bypassing features line: $_");
660 $_ = $self->_readline;
662 $self->_pushback($_);
664 elsif (/Sequences producing significant alignments:/) {
665 $self->debug("blast.pm: Processing NCBI-BLAST descriptions\n");
666 $flavor = 'ncbi';
668 # PSI-BLAST parsing needs to be fixed to specifically look
669 # for old vs new per iteration, as sorting based on duplication
670 # leads to bugs, see bug 1986
672 # The next line is not necessarily whitespace in psiblast reports.
673 # Also note that we must look for the end of this section by testing
674 # for a line with a leading >. Blank lines occur with this section
675 # for psiblast.
676 if ( !$self->in_element('iteration') ) {
677 $self->start_element( { 'Name' => 'Iteration' } );
680 # changed 8/28/2008 to exit hit table if blank line is found after an
681 # appropriate line
682 my $h_regex;
683 my $seen_block;
684 DESCLINE:
685 while ( defined( my $descline = $self->_readline() ) ) {
686 if ($descline =~ m{^\s*$}) {
687 last DESCLINE if $seen_block;
688 next DESCLINE;
690 # any text match is part of block...
691 $seen_block++;
692 # GCG multiline oddness...
693 if ($descline =~ /^(\S+)\s+Begin:\s\d+\s+End:\s+\d+/xms) {
694 my ($id, $nextline) = ($1, $self->_readline);
695 $nextline =~ s{^!}{};
696 $descline = "$id $nextline";
698 # NCBI style hit table (no N)
699 if ($descline =~ /(?<!cor) # negative lookahead
700 (\d*\.?(?:[\+\-eE]+)?\d+) # number (float or scientific notation)
701 \s+ # space
702 (\d*\.?(?:[\+\-eE]+)?\d+) # number (float or scientific notation)
703 \s*$/xms) {
705 my ( $score, $evalue ) = ($1, $2);
707 # Some data clean-up so e-value will appear numeric to perl
708 $evalue =~ s/^e/1e/i;
710 # This to handle no-HSP case
711 my @line = split ' ',$descline;
713 # we want to throw away the score, evalue
714 pop @line, pop @line;
716 # and N if it is present (of course they are not
717 # really in that order, but it doesn't matter
718 if ($3) { pop @line }
720 # add the last 2 entries s.t. we can reconstruct
721 # a minimal Hit object at the end of the day
722 push @hit_signifs, [ $evalue, $score, shift @line, join( ' ', @line ) ];
723 } elsif ($descline =~ /^CONVERGED/i) {
724 $self->element(
726 'Name' => 'Iteration_converged',
727 'Data' => 1
730 } else {
731 $self->_pushback($descline); # Catch leading > (end of section)
732 last DESCLINE;
736 elsif (/Sequences producing High-scoring Segment Pairs:/) {
738 # This block is for WU-BLAST, so we don't have to check for psi-blast stuff
739 # skip the next line
740 $self->debug("blast.pm: Processing WU-BLAST descriptions\n");
741 $_ = $self->_readline();
742 $flavor = 'wu';
744 if ( !$self->in_element('iteration') ) {
745 $self->start_element( { 'Name' => 'Iteration' } );
748 while ( defined( $_ = $self->_readline() )
749 && !/^\s+$/ )
751 my @line = split;
752 pop @line; # throw away first number which is for 'N'col
754 # add the last 2 entries to array s.t. we can reconstruct
755 # a minimal Hit object at the end of the day
756 push @hit_signifs,
757 [ pop @line, pop @line, shift @line, join( ' ', @line ) ];
761 elsif (/^Database:\s*(.+?)\s*$/) {
763 $self->debug("blast.pm: Database: $1\n");
764 my $db = $1;
765 while ( defined( $_ = $self->_readline ) ) {
766 if (
767 /^\s+(\-?[\d\,]+|\S+)\s+sequences\;
768 \s+(\-?[\d,]+|\S+)\s+ # Deal with NCBI 2.2.8 OSX problems
769 total\s+letters/ox
772 my ( $s, $l ) = ( $1, $2 );
773 $s =~ s/,//g;
774 $l =~ s/,//g;
775 $self->element(
777 'Name' => 'BlastOutput_db-len',
778 'Data' => $s
781 $self->element(
783 'Name' => 'BlastOutput_db-let',
784 'Data' => $l
787 # cache for next round in cases with multiple queries
788 $self->{'_blsdb'} = $db;
789 $self->{'_blsdb_length'} = $s;
790 $self->{'_blsdb_letters'} = $l;
791 last;
793 else {
794 chomp;
795 $db .= $_;
798 $self->element(
800 'Name' => 'BlastOutput_db',
801 'Data' => $db
806 # move inside of a hit
807 elsif (/^>\s*(\S+)\s*(.*)?/) {
808 chomp;
810 $self->debug("blast.pm: Hit: $1\n");
811 $self->in_element('hsp')
812 && $self->end_element( { 'Name' => 'Hsp' } );
813 $self->in_element('hit')
814 && $self->end_element( { 'Name' => 'Hit' } );
816 # special case when bl2seq reports don't have a leading
817 # Query=
818 if ( !$self->within_element('result') ) {
819 $self->_start_blastoutput;
820 $self->start_element( { 'Name' => 'Iteration' } );
822 elsif ( !$self->within_element('iteration') ) {
823 $self->start_element( { 'Name' => 'Iteration' } );
825 $self->start_element( { 'Name' => 'Hit' } );
826 my $id = $1;
827 my $restofline = $2;
829 $self->debug("Starting a hit: $1 $2\n");
830 $self->element(
832 'Name' => 'Hit_id',
833 'Data' => $id
836 my ($gi, $acc, $version ) = $self->_get_seq_identifiers($id);
837 $self->element(
839 'Name' => 'Hit_accession',
840 'Data' => $acc
843 # add hit significance (from the hit table)
844 # this is where Bug 1986 went awry
846 # Changed for Bug2409; hit->significance and hit->score/bits derived
847 # from HSPs, not hit table unless necessary
849 HITTABLE:
850 while (my $v = shift @hit_signifs) {
851 my $tableid = $v->[2];
852 if ($tableid !~ m{\Q$id\E}) {
853 $self->debug("Hit table ID $tableid doesn't match current hit id $id, checking next hit table entry...\n");
854 next HITTABLE;
855 } else {
856 last HITTABLE;
859 while ( defined( $_ = $self->_readline() ) ) {
860 next if (/^\s+$/);
861 chomp;
862 if (/Length\s*=\s*([\d,]+)/) {
863 my $l = $1;
864 $l =~ s/\,//g;
865 $self->element(
867 'Name' => 'Hit_len',
868 'Data' => $l
871 last;
873 else {
874 if ($restofline !~ /\s$/) { # bug #3235
875 s/^\s(?!\s)/\x01/; #new line to concatenate desc lines with <soh>
877 $restofline .= $_;
880 $restofline =~ s/\s+/ /g;
881 $self->element(
883 'Name' => 'Hit_def',
884 'Data' => $restofline
888 elsif (/\s+(Plus|Minus) Strand HSPs:/i) {
889 next;
891 elsif (
892 ( $self->in_element('hit') || $self->in_element('hsp') )
893 && # paracel genewise BTK
894 m/Score\s*=\s*(\S+)\s*bits\s* # Bit score
895 (?:\((\d+)\))?, # Raw score
896 \s+Log\-Length\sScore\s*=\s*(\d+) # Log-Length score
900 $self->in_element('hsp')
901 && $self->end_element( { 'Name' => 'Hsp' } );
902 $self->start_element( { 'Name' => 'Hsp' } );
904 $self->debug( "Got paracel genewise HSP score=$1\n");
906 # Some data clean-up so e-value will appear numeric to perl
907 my ( $bits, $score, $evalue ) = ( $1, $2, $3 );
908 $evalue =~ s/^e/1e/i;
909 $self->element(
911 'Name' => 'Hsp_score',
912 'Data' => $score
915 $self->element(
917 'Name' => 'Hsp_bit-score',
918 'Data' => $bits
921 $self->element(
923 'Name' => 'Hsp_evalue',
924 'Data' => $evalue
928 elsif (
929 ( $self->in_element('hit') || $self->in_element('hsp') )
930 && # paracel hframe BTK
931 m/Score\s*=\s*([^,\s]+), # Raw score
932 \s*Expect\s*=\s*([^,\s]+), # E-value
933 \s*P(?:\(\S+\))?\s*=\s*([^,\s]+) # P-value
937 $self->in_element('hsp')
938 && $self->end_element( { 'Name' => 'Hsp' } );
939 $self->start_element( { 'Name' => 'Hsp' } );
941 $self->debug( "Got paracel hframe HSP score=$1\n");
943 # Some data clean-up so e-value will appear numeric to perl
944 my ( $score, $evalue, $pvalue ) = ( $1, $2, $3 );
945 $evalue = "1$evalue" if $evalue =~ /^e/;
946 $pvalue = "1$pvalue" if $pvalue =~ /^e/;
948 $self->element(
950 'Name' => 'Hsp_score',
951 'Data' => $score
954 $self->element(
956 'Name' => 'Hsp_evalue',
957 'Data' => $evalue
960 $self->element(
962 'Name' => 'Hsp_pvalue',
963 'Data' => $pvalue
967 elsif (
968 ( $self->in_element('hit') || $self->in_element('hsp') )
969 && # wublast
970 m/Score\s*=\s*(\S+)\s* # Bit score
971 \(([\d\.]+)\s*bits\), # Raw score
972 \s*Expect\s*=\s*([^,\s]+), # E-value
973 \s*(?:Sum)?\s* # SUM
974 P(?:\(\d+\))?\s*=\s*([^,\s]+) # P-value
975 (?:\s*,\s+Group\s*\=\s*(\d+))? # HSP Group
978 { # wu-blast HSP parse
979 $self->in_element('hsp')
980 && $self->end_element( { 'Name' => 'Hsp' } );
981 $self->start_element( { 'Name' => 'Hsp' } );
983 # Some data clean-up so e-value will appear numeric to perl
984 my ( $score, $bits, $evalue, $pvalue, $group ) =
985 ( $1, $2, $3, $4, $5 );
986 $evalue =~ s/^e/1e/i;
987 $pvalue =~ s/^e/1e/i;
989 $self->element(
991 'Name' => 'Hsp_score',
992 'Data' => $score
995 $self->element(
997 'Name' => 'Hsp_bit-score',
998 'Data' => $bits
1001 $self->element(
1003 'Name' => 'Hsp_evalue',
1004 'Data' => $evalue
1007 $self->element(
1009 'Name' => 'Hsp_pvalue',
1010 'Data' => $pvalue
1014 if ( defined $group ) {
1015 $self->element(
1017 'Name' => 'Hsp_group',
1018 'Data' => $group
1024 elsif (
1025 ( $self->in_element('hit') || $self->in_element('hsp') )
1026 && # ncbi blast, works with 2.2.17
1027 m/^\sFeatures\s\w+\sthis\spart/xmso
1028 # If the line begins with "Features in/flanking this part of subject sequence:"
1031 $self->in_element('hsp')
1032 && $self->end_element( { 'Name' => 'Hsp' } );
1033 my $featline;
1034 $_ = $self->_readline;
1035 while($_ !~ /^\s*$/) {
1036 chomp;
1037 $featline .= $_;
1038 $_ = $self->_readline;
1040 $self->_pushback($_);
1041 $featline =~ s{(?:^\s+|\s+^)}{}g;
1042 $featline =~ s{\n}{;}g;
1043 $self->start_element( { 'Name' => 'Hsp' } );
1044 $self->element(
1046 'Name' => 'Hsp_features',
1047 'Data' => $featline
1050 $self->{'_seen_hsp_features'} = 1;
1052 elsif (
1053 ( $self->in_element('hit') || $self->in_element('hsp') )
1054 && # ncbi blast, works with 2.2.17
1055 m/Score\s*=\s*(\S+)\s*bits\s* # Bit score
1056 (?:\((\d+)\))?, # Missing for BLAT pseudo-BLAST fmt
1057 \s*Expect(?:\((\d+\+?)\))?\s*=\s*([^,\s]+) # E-value
1060 { # parse NCBI blast HSP
1061 if( !$self->{'_seen_hsp_features'} ) {
1062 $self->in_element('hsp')
1063 && $self->end_element( { 'Name' => 'Hsp' } );
1064 $self->start_element( { 'Name' => 'Hsp' } );
1067 # Some data clean-up so e-value will appear numeric to perl
1068 my ( $bits, $score, $n, $evalue ) = ( $1, $2, $3, $4 );
1069 $evalue =~ s/^e/1e/i;
1070 $self->element(
1072 'Name' => 'Hsp_score',
1073 'Data' => $score
1076 $self->element(
1078 'Name' => 'Hsp_bit-score',
1079 'Data' => $bits
1082 $self->element(
1084 'Name' => 'Hsp_evalue',
1085 'Data' => $evalue
1088 $self->element(
1090 'Name' => 'Hsp_n',
1091 'Data' => $n
1093 ) if defined $n;
1094 $score = '' unless defined $score; # deal with BLAT which
1095 # has no score only bits
1096 $self->debug("Got NCBI HSP score=$score, evalue $evalue\n");
1098 elsif (
1099 $self->in_element('hsp')
1100 && m/Identities\s*=\s*(\d+)\s*\/\s*(\d+)\s*[\d\%\(\)]+\s*
1101 (?:,\s*Positives\s*=\s*(\d+)\/(\d+)\s*[\d\%\(\)]+\s*)? # pos only valid for Protein alignments
1102 (?:\,\s*Gaps\s*=\s*(\d+)\/(\d+))? # Gaps
1103 /oxi
1106 $self->element(
1108 'Name' => 'Hsp_identity',
1109 'Data' => $1
1112 $self->element(
1114 'Name' => 'Hsp_align-len',
1115 'Data' => $2
1118 if ( defined $3 ) {
1119 $self->element(
1121 'Name' => 'Hsp_positive',
1122 'Data' => $3
1126 else {
1127 $self->element(
1129 'Name' => 'Hsp_positive',
1130 'Data' => $1
1134 if ( defined $6 ) {
1135 $self->element(
1137 'Name' => 'Hsp_gaps',
1138 'Data' => $5
1143 $self->{'_Query'} = { 'begin' => 0, 'end' => 0 };
1144 $self->{'_Sbjct'} = { 'begin' => 0, 'end' => 0 };
1146 if (/(Frame\s*=\s*.+)$/) {
1148 # handle wu-blast Frame listing on same line
1149 $self->_pushback($1);
1152 elsif ( $self->in_element('hsp')
1153 && /Strand\s*=\s*(Plus|Minus)\s*\/\s*(Plus|Minus)/i )
1156 # consume this event ( we infer strand from start/end)
1157 if (!defined($reporttype)) {
1158 $self->{'_reporttype'} = $reporttype = 'BLASTN';
1159 $bl2seq_fix = 1; # special case to resubmit the algorithm
1160 # reporttype
1162 next;
1164 elsif ( $self->in_element('hsp')
1165 && /Links\s*=\s*(\S+)/ox )
1167 $self->element(
1169 'Name' => 'Hsp_links',
1170 'Data' => $1
1174 elsif ( $self->in_element('hsp')
1175 && /Frame\s*=\s*([\+\-][1-3])\s*(\/\s*([\+\-][1-3]))?/ )
1177 my $frame1 = $1 || 0;
1178 my $frame2 = $2 || 0;
1179 # this is for bl2seq only
1180 if ( not defined $reporttype ) {
1181 $bl2seq_fix = 1;
1182 if ( $frame1 && $frame2 ) {
1183 $reporttype = 'TBLASTX'
1185 else {
1186 # We can distinguish between BLASTX and TBLASTN from the report
1187 # (and assign $frame1 properly) by using the start/end from query.
1188 # If the report is BLASTX, the coordinates distance from query
1189 # will be 3 times the length of the alignment shown (coordinates in nt,
1190 # alignment in aa); if not then subject is the nucleotide sequence (TBLASTN).
1191 # Will have to fast-forward to query alignment line and then go back.
1192 my $fh = $self->_fh;
1193 my $file_pos = tell $fh;
1195 my $a_position = '';
1196 my $ali_length = '';
1197 my $b_position = '';
1198 while (my $line = <$fh>) {
1199 if ($line =~ m/^(?:Query|Sbjct):?\s+(\-?\d+)?\s*(\S+)\s+(\-?\d+)?/) {
1200 $a_position = $1;
1201 my $alignment = $2;
1202 $b_position = $3;
1204 use Bio::LocatableSeq;
1205 my $gap_symbols = $Bio::LocatableSeq::GAP_SYMBOLS;
1206 $alignment =~ s/[$gap_symbols]//g;
1207 $ali_length = length($alignment);
1208 last;
1211 my $coord_length = ($a_position < $b_position) ? ($b_position - $a_position + 1)
1212 : ($a_position - $b_position + 1);
1213 ($coord_length == ($ali_length * 3)) ? ($reporttype = 'BLASTX') : ($reporttype = 'TBLASTN');
1215 # Rewind filehandle to its original position to continue parsing
1216 seek $fh, $file_pos, 0;
1218 $self->{'_reporttype'} = $reporttype;
1221 my ( $queryframe, $hitframe );
1222 if ( $reporttype eq 'TBLASTX' ) {
1223 ( $queryframe, $hitframe ) = ( $frame1, $frame2 );
1224 $hitframe =~ s/\/\s*//g;
1226 elsif ( $reporttype eq 'TBLASTN' || $reporttype eq 'PSITBLASTN') {
1227 ( $hitframe, $queryframe ) = ( $frame1, 0 );
1229 elsif ( $reporttype eq 'BLASTX' || $reporttype eq 'RPS-BLAST(BLASTP)') {
1230 ( $queryframe, $hitframe ) = ( $frame1, 0 );
1231 # though NCBI doesn't report it, this is a special BLASTX-like
1232 # RPS-BLAST; should be handled differently
1233 if ($reporttype eq 'RPS-BLAST(BLASTP)') {
1234 $self->element(
1236 'Name' => 'BlastOutput_program',
1237 'Data' => 'RPS-BLAST(BLASTX)'
1242 $self->element(
1244 'Name' => 'Hsp_query-frame',
1245 'Data' => $queryframe
1249 $self->element(
1251 'Name' => 'Hsp_hit-frame',
1252 'Data' => $hitframe
1256 elsif (/^Parameters:/
1257 || /^\s+Database:\s+?/
1258 || /^\s+Subset/
1259 || /^\s*Lambda/
1260 || /^\s*Histogram/
1261 || ( $self->in_element('hsp') && /WARNING|NOTE/ ) )
1264 # Note: Lambda check was necessary to parse
1265 # t/data/ecoli_domains.rpsblast AND to parse bl2seq
1266 $self->debug("blast.pm: found parameters section \n");
1268 $self->in_element('hsp')
1269 && $self->end_element( { 'Name' => 'Hsp' } );
1270 $self->in_element('hit')
1271 && $self->end_element( { 'Name' => 'Hit' } );
1273 # This is for the case when we specify -b 0 (or B=0 for WU-BLAST)
1274 # and still want to construct minimal Hit objects
1275 $self->_cleanup_hits(\@hit_signifs) if scalar(@hit_signifs);
1276 $self->within_element('iteration')
1277 && $self->end_element( { 'Name' => 'Iteration' } );
1279 next if /^\s+Subset/;
1280 my $blast = (/^(\s+Database\:)|(\s*Lambda)/) ? 'ncbi' : 'wublast';
1281 if (/^\s*Histogram/) {
1282 $blast = 'btk';
1285 my $last = '';
1287 # default is that gaps are allowed
1288 $self->element(
1290 'Name' => 'Parameters_allowgaps',
1291 'Data' => 'yes'
1294 while ( defined( $_ = $self->_readline ) ) {
1295 if (
1296 /^((?:\S+)?BLAST[NPX]?)\s+(.+)$/i # NCBI BLAST, PSIBLAST
1297 # RPSBLAST, MEGABLAST
1298 || /^(P?GENEWISE|HFRAME|SWN|TSWN)\s+(.+)/i #Paracel BTK
1301 $self->_pushback($_);
1303 # let's handle this in the loop
1304 last;
1306 elsif (/^Query=/) {
1307 $self->_pushback($_);
1308 $self->_pushback($reportline) if $reportline;
1309 last PARSER;
1312 # here is where difference between wublast and ncbiblast
1313 # is better handled by different logic
1314 if ( /Number of Sequences:\s+([\d\,]+)/i
1315 || /of sequences in database:\s+(\-?[\d,]+)/i )
1317 my $c = $1;
1318 $c =~ s/\,//g;
1319 $self->element(
1321 'Name' => 'Statistics_db-len',
1322 'Data' => $c
1326 elsif (/letters in database:\s+(\-?[\d,]+)/i) {
1327 my $s = $1;
1328 $s =~ s/,//g;
1329 $self->element(
1331 'Name' => 'Statistics_db-let',
1332 'Data' => $s
1336 elsif ( $blast eq 'btk' ) {
1337 next;
1339 elsif ( $blast eq 'wublast' ) {
1341 # warn($_);
1342 if (/E=(\S+)/) {
1343 $self->element(
1345 'Name' => 'Parameters_expect',
1346 'Data' => $1
1350 elsif (/nogaps/) {
1351 $self->element(
1353 'Name' => 'Parameters_allowgaps',
1354 'Data' => 'no'
1358 elsif (/ctxfactor=(\S+)/) {
1359 $self->element(
1361 'Name' => 'Statistics_ctxfactor',
1362 'Data' => $1
1366 elsif (
1367 /(postsw|links|span[12]?|warnings|notes|gi|noseqs|qres|qype)/
1370 $self->element(
1372 'Name' => "Parameters_$1",
1373 'Data' => 'yes'
1377 elsif (/(\S+)=(\S+)/) {
1378 $self->element(
1380 'Name' => "Parameters_$1",
1381 'Data' => $2
1385 elsif ( $last =~ /(Frame|Strand)\s+MatID\s+Matrix name/i ) {
1386 my $firstgapinfo = 1;
1387 my $frame = undef;
1388 while ( defined($_) && !/^\s+$/ ) {
1389 s/^\s+//;
1390 s/\s+$//;
1391 if ( $firstgapinfo
1392 && s/Q=(\d+),R=(\d+)\s+//x )
1394 $firstgapinfo = 0;
1396 $self->element(
1398 'Name' => 'Parameters_gap-open',
1399 'Data' => $1
1402 $self->element(
1404 'Name' => 'Parameters_gap-extend',
1405 'Data' => $2
1408 my @fields = split;
1410 for my $type (
1411 qw(lambda_gapped
1412 kappa_gapped
1413 entropy_gapped)
1416 next if $type eq 'n/a';
1417 if ( !@fields ) {
1418 warn "fields is empty for $type\n";
1419 next;
1421 $self->element(
1423 'Name' =>
1424 "Statistics_frame$frame\_$type",
1425 'Data' => shift @fields
1430 else {
1431 my ( $frameo, $matid, $matrix, @fields ) =
1432 split;
1433 if ( !defined $frame ) {
1435 # keep some sort of default feature I guess
1436 # even though this is sort of wrong
1437 $self->element(
1439 'Name' => 'Parameters_matrix',
1440 'Data' => $matrix
1443 $self->element(
1445 'Name' => 'Statistics_lambda',
1446 'Data' => $fields[0]
1449 $self->element(
1451 'Name' => 'Statistics_kappa',
1452 'Data' => $fields[1]
1455 $self->element(
1457 'Name' => 'Statistics_entropy',
1458 'Data' => $fields[2]
1462 $frame = $frameo;
1463 my $ii = 0;
1464 for my $type (
1465 qw(lambda_used
1466 kappa_used
1467 entropy_used
1468 lambda_computed
1469 kappa_computed
1470 entropy_computed)
1473 my $f = $fields[$ii];
1474 next unless defined $f; # deal with n/a
1475 if ( $f eq 'same' ) {
1476 $f = $fields[ $ii - 3 ];
1478 $ii++;
1479 $self->element(
1481 'Name' =>
1482 "Statistics_frame$frame\_$type",
1483 'Data' => $f
1490 # get the next line
1491 $_ = $self->_readline;
1493 $last = $_;
1495 elsif ( $last =~ /(Frame|Strand)\s+MatID\s+Length/i ) {
1496 my $frame = undef;
1497 while ( defined($_) && !/^\s+/ ) {
1498 s/^\s+//;
1499 s/\s+$//;
1500 my @fields = split;
1501 if ( @fields <= 3 ) {
1502 for my $type (qw(X_gapped E2_gapped S2)) {
1503 last unless @fields;
1504 $self->element(
1506 'Name' =>
1507 "Statistics_frame$frame\_$type",
1508 'Data' => shift @fields
1513 else {
1515 for my $type (
1516 qw(length
1517 efflength
1518 E S W T X E2 S2)
1521 $self->element(
1523 'Name' =>
1524 "Statistics_frame$frame\_$type",
1525 'Data' => shift @fields
1530 $_ = $self->_readline;
1532 $last = $_;
1534 elsif (/(\S+\s+\S+)\s+DFA:\s+(\S+)\s+\((.+)\)/) {
1535 if ( $1 eq 'states in' ) {
1536 $self->element(
1538 'Name' => 'Statistics_DFA_states',
1539 'Data' => "$2 $3"
1543 elsif ( $1 eq 'size of' ) {
1544 $self->element(
1546 'Name' => 'Statistics_DFA_size',
1547 'Data' => "$2 $3"
1552 elsif (
1553 m/^\s+Time to generate neighborhood:\s+
1554 (\S+\s+\S+\s+\S+)/x
1557 $self->element(
1559 'Name' => 'Statistics_neighbortime',
1560 'Data' => $1
1564 elsif (/processors\s+used:\s+(\d+)/) {
1565 $self->element(
1567 'Name' => 'Statistics_noprocessors',
1568 'Data' => $1
1572 elsif (
1573 m/^\s+(\S+)\s+cpu\s+time:\s+ # cputype
1574 (\S+\s+\S+\s+\S+) # cputime
1575 \s+Elapsed:\s+(\S+)/x
1578 my $cputype = lc($1);
1579 $self->element(
1581 'Name' => "Statistics_$cputype\_cputime",
1582 'Data' => $2
1585 $self->element(
1587 'Name' => "Statistics_$cputype\_actualtime",
1588 'Data' => $3
1592 elsif (/^\s+Start:/) {
1593 my ( $junk, $start, $stime, $end, $etime ) =
1594 split( /\s+(Start|End)\:\s+/, $_ );
1595 chomp($stime);
1596 $self->element(
1598 'Name' => 'Statistics_starttime',
1599 'Data' => $stime
1602 chomp($etime);
1603 $self->element(
1605 'Name' => 'Statistics_endtime',
1606 'Data' => $etime
1610 elsif (/^\s+Database:\s+(.+)$/) {
1611 $self->element(
1613 'Name' => 'Parameters_full_dbpath',
1614 'Data' => $1
1619 elsif (/^\s+Posted:\s+(.+)/) {
1620 my $d = $1;
1621 chomp($d);
1622 $self->element(
1624 'Name' => 'Statistics_posted_date',
1625 'Data' => $d
1630 elsif ( $blast eq 'ncbi' ) {
1632 if (m/^Matrix:\s+(.+)\s*$/oxi) {
1633 $self->element(
1635 'Name' => 'Parameters_matrix',
1636 'Data' => $1
1640 elsif (/^Gapped/) {
1641 $gapped_stats = 1;
1643 elsif (/^Lambda/) {
1644 $_ = $self->_readline;
1645 s/^\s+//;
1646 my ( $lambda, $kappa, $entropy ) = split;
1647 if ($gapped_stats) {
1648 $self->element(
1650 'Name' => "Statistics_gapped_lambda",
1651 'Data' => $lambda
1654 $self->element(
1656 'Name' => "Statistics_gapped_kappa",
1657 'Data' => $kappa
1660 $self->element(
1662 'Name' => "Statistics_gapped_entropy",
1663 'Data' => $entropy
1667 else {
1668 $self->element(
1670 'Name' => "Statistics_lambda",
1671 'Data' => $lambda
1674 $self->element(
1676 'Name' => "Statistics_kappa",
1677 'Data' => $kappa
1680 $self->element(
1682 'Name' => "Statistics_entropy",
1683 'Data' => $entropy
1688 elsif (m/effective\s+search\s+space\s+used:\s+(\d+)/oxi) {
1689 $self->element(
1691 'Name' => 'Statistics_eff-spaceused',
1692 'Data' => $1
1696 elsif (m/effective\s+search\s+space:\s+(\d+)/oxi) {
1697 $self->element(
1699 'Name' => 'Statistics_eff-space',
1700 'Data' => $1
1704 elsif (
1705 m/Gap\s+Penalties:\s+Existence:\s+(\d+)\,
1706 \s+Extension:\s+(\d+)/ox
1709 $self->element(
1711 'Name' => 'Parameters_gap-open',
1712 'Data' => $1
1715 $self->element(
1717 'Name' => 'Parameters_gap-extend',
1718 'Data' => $2
1722 elsif (/effective\s+HSP\s+length:\s+(\d+)/) {
1723 $self->element(
1725 'Name' => 'Statistics_hsp-len',
1726 'Data' => $1
1730 elsif (/Number\s+of\s+HSP's\s+better\s+than\s+(\S+)\s+without\s+gapping:\s+(\d+)/) {
1731 $self->element(
1733 'Name' => 'Statistics_number_of_hsps_better_than_expect_value_cutoff_without_gapping',
1734 'Data' => $2
1738 elsif (/Number\s+of\s+HSP's\s+gapped:\s+(\d+)/) {
1739 $self->element(
1741 'Name' => 'Statistics_number_of_hsps_gapped',
1742 'Data' => $1
1746 elsif (/Number\s+of\s+HSP's\s+successfully\s+gapped:\s+(\d+)/) {
1747 $self->element(
1749 'Name' => 'Statistics_number_of_hsps_successfully_gapped',
1750 'Data' => $1
1754 elsif (/Length\s+adjustment:\s+(\d+)/) {
1755 $self->element(
1757 'Name' => 'Statistics_length_adjustment',
1758 'Data' => $1
1762 elsif (/effective\s+length\s+of\s+query:\s+([\d\,]+)/i) {
1763 my $c = $1;
1764 $c =~ s/\,//g;
1765 $self->element(
1767 'Name' => 'Statistics_query-len',
1768 'Data' => $c
1772 elsif (/effective\s+length\s+of\s+database:\s+([\d\,]+)/i) {
1773 my $c = $1;
1774 $c =~ s/\,//g;
1775 $self->element(
1777 'Name' => 'Statistics_eff-dblen',
1778 'Data' => $c
1782 elsif (
1783 /^(T|A|X1|X2|X3|S1|S2):\s+(\d+(\.\d+)?)\s+(?:\(\s*(\d+\.\d+) bits\))?/
1786 my $v = $2;
1787 chomp($v);
1788 $self->element(
1790 'Name' => "Statistics_$1",
1791 'Data' => $v
1794 if ( defined $4 ) {
1795 $self->element(
1797 'Name' => "Statistics_$1_bits",
1798 'Data' => $4
1803 elsif (
1804 m/frameshift\s+window\,
1805 \s+decay\s+const:\s+(\d+)\,\s+([\.\d]+)/x
1808 $self->element(
1810 'Name' => 'Statistics_framewindow',
1811 'Data' => $1
1814 $self->element(
1816 'Name' => 'Statistics_decay',
1817 'Data' => $2
1821 elsif (m/^Number\s+of\s+Hits\s+to\s+DB:\s+(\S+)/ox) {
1822 $self->element(
1824 'Name' => 'Statistics_hit_to_db',
1825 'Data' => $1
1829 elsif (m/^Number\s+of\s+extensions:\s+(\S+)/ox) {
1830 $self->element(
1832 'Name' => 'Statistics_num_extensions',
1833 'Data' => $1
1837 elsif (
1838 m/^Number\s+of\s+successful\s+extensions:\s+
1839 (\S+)/ox
1842 $self->element(
1844 'Name' => 'Statistics_num_suc_extensions',
1845 'Data' => $1
1849 elsif (
1850 m/^Number\s+of\s+sequences\s+better\s+than\s+
1851 (\S+):\s+(\d+)/ox
1854 $self->element(
1856 'Name' => 'Parameters_expect',
1857 'Data' => $1
1860 $self->element(
1862 'Name' => 'Statistics_seqs_better_than_cutoff',
1863 'Data' => $2
1867 elsif (/^\s+Posted\s+date:\s+(.+)/) {
1868 my $d = $1;
1869 chomp($d);
1870 $self->element(
1872 'Name' => 'Statistics_posted_date',
1873 'Data' => $d
1877 elsif ( !/^\s+$/ ) {
1878 #$self->debug( "unmatched stat $_");
1881 $last = $_;
1883 } elsif ( $self->in_element('hsp') ) {
1884 $self->debug("blast.pm: Processing HSP\n");
1885 # let's read 3 lines at a time;
1886 # bl2seq hackiness... Not sure I like
1887 $self->{'_reporttype'} ||= $DEFAULTREPORTTYPE;
1888 my %data = (
1889 'Query' => '',
1890 'Mid' => '',
1891 'Hit' => ''
1893 my $len;
1894 for ( my $i = 0 ; defined($_) && $i < 3 ; $i++ ) {
1895 # $self->debug("$i: $_") if $v;
1896 if ( ( $i == 0 && /^\s+$/) ||
1897 /^\s*(?:Lambda|Minus|Plus|Score)/i
1899 $self->_pushback($_) if defined $_;
1900 $self->end_element( { 'Name' => 'Hsp' } );
1901 last;
1903 chomp;
1904 if (/^((Query|Sbjct):?\s+(\-?\d+)?\s*)(\S+)\s+(\-?\d+)?/) {
1905 my ( $full, $type, $start, $str, $end ) =
1906 ( $1, $2, $3, $4, $5 );
1908 if ( $str eq '-' ) {
1909 $i = 3 if $type eq 'Sbjct';
1911 else {
1912 $data{$type} = $str;
1914 $len = length($full);
1915 $self->{"\_$type"}->{'begin'} = $start
1916 unless $self->{"_$type"}->{'begin'};
1917 $self->{"\_$type"}->{'end'} = $end;
1918 } elsif (/^((Query|Sbjct):?\s+(\-?0+)\s*)/) {
1919 # Bug from NCBI's BLAST: unaligned output
1920 $_ = $self->_readline() for 0..1;
1921 last;
1922 } else {
1923 $self->throw("no data for midline $_")
1924 unless ( defined $_ && defined $len );
1925 $data{'Mid'} = substr( $_, $len );
1927 $_ = $self->_readline();
1929 $self->characters(
1931 'Name' => 'Hsp_qseq',
1932 'Data' => $data{'Query'}
1935 $self->characters(
1937 'Name' => 'Hsp_hseq',
1938 'Data' => $data{'Sbjct'}
1941 $self->characters(
1943 'Name' => 'Hsp_midline',
1944 'Data' => $data{'Mid'}
1948 else {
1949 #$self->debug("blast.pm: unrecognized line $_");
1953 $self->debug("blast.pm: End of BlastOutput\n");
1954 if ( $self->{'_seentop'} ) {
1955 $self->within_element('hsp')
1956 && $self->end_element( { 'Name' => 'Hsp' } );
1957 $self->within_element('hit')
1958 && $self->end_element( { 'Name' => 'Hit' } );
1959 # cleanup extra hits
1960 $self->_cleanup_hits(\@hit_signifs) if scalar(@hit_signifs);
1961 $self->within_element('iteration')
1962 && $self->end_element( { 'Name' => 'Iteration' } );
1963 if ($bl2seq_fix) {
1964 $self->element(
1966 'Name' => 'BlastOutput_program',
1967 'Data' => $reporttype
1971 $self->end_element( { 'Name' => 'BlastOutput' } );
1973 return $self->end_document();
1976 # Private method for internal use only.
1977 sub _start_blastoutput {
1978 my $self = shift;
1979 $self->start_element( { 'Name' => 'BlastOutput' } );
1980 $self->{'_seentop'} = 1;
1981 $self->{'_result_count'}++;
1982 $self->{'_handler_rc'} = undef;
1985 =head2 _will_handle
1987 Title : _will_handle
1988 Usage : Private method. For internal use only.
1989 if( $self->_will_handle($type) ) { ... }
1990 Function: Provides an optimized way to check whether or not an element of a
1991 given type is to be handled.
1992 Returns : Reference to EventHandler object if the element type is to be handled.
1993 undef if the element type is not to be handled.
1994 Args : string containing type of element.
1996 Optimizations:
1998 =over 2
2000 =item 1
2002 Using the cached pointer to the EventHandler to minimize repeated
2003 lookups.
2005 =item 2
2007 Caching the will_handle status for each type that is encountered so
2008 that it only need be checked by calling
2009 handler-E<gt>will_handle($type) once.
2011 =back
2013 This does not lead to a major savings by itself (only 5-10%). In
2014 combination with other optimizations, or for large parse jobs, the
2015 savings good be significant.
2017 To test against the unoptimized version, remove the parentheses from
2018 around the third term in the ternary " ? : " operator and add two
2019 calls to $self-E<gt>_eventHandler().
2021 =cut
2023 sub _will_handle {
2024 my ( $self, $type ) = @_;
2025 my $handler = $self->{'_handler_cache'};
2026 my $will_handle =
2027 defined( $self->{'_will_handle_cache'}->{$type} )
2028 ? $self->{'_will_handle_cache'}->{$type}
2029 : ( $self->{'_will_handle_cache'}->{$type} =
2030 $handler->will_handle($type) );
2032 return $will_handle ? $handler : undef;
2035 =head2 start_element
2037 Title : start_element
2038 Usage : $eventgenerator->start_element
2039 Function: Handles a start element event
2040 Returns : none
2041 Args : hashref with at least 2 keys 'Data' and 'Name'
2043 =cut
2045 sub start_element {
2046 my ( $self, $data ) = @_;
2048 # we currently don't care about attributes
2049 my $nm = $data->{'Name'};
2050 my $type = $MODEMAP{$nm};
2051 if ($type) {
2052 my $handler = $self->_will_handle($type);
2053 if ($handler) {
2054 my $func = sprintf( "start_%s", lc $type );
2055 $self->{'_handler_rc'} = $handler->$func( $data->{'Attributes'} );
2057 #else {
2058 #$self->debug( # changed 4/29/2006 to play nice with other event handlers
2059 # "Bio::SearchIO::InternalParserError ".
2060 # "\nCan't handle elements of type \'$type.\'"
2063 unshift @{ $self->{'_elements'} }, $type;
2064 if ( $type eq 'result' ) {
2065 $self->{'_values'} = {};
2066 $self->{'_result'} = undef;
2067 } else {
2068 # cleanup some things
2069 if ( defined $self->{'_values'} ) {
2070 foreach my $k (
2071 grep { /^\U$type\-/ }
2072 keys %{ $self->{'_values'} }
2075 delete $self->{'_values'}->{$k};
2082 =head2 end_element
2084 Title : end_element
2085 Usage : $eventgenerator->end_element
2086 Function: Handles an end element event
2087 Returns : hashref with an element's worth of data
2088 Args : hashref with at least 2 keys 'Data' and 'Name'
2091 =cut
2093 sub end_element {
2094 my ( $self, $data ) = @_;
2096 my $nm = $data->{'Name'};
2097 my $type;
2098 my $rc;
2099 # cache these (TODO: we should probably cache all cross-report data)
2100 if ( $nm eq 'BlastOutput_program' ) {
2101 if ( $self->{'_last_data'} =~ /(t?blast[npx])/i ) {
2102 $self->{'_reporttype'} = uc $1;
2104 $self->{'_reporttype'} ||= $DEFAULTREPORTTYPE;
2107 if ( $nm eq 'BlastOutput_version' ) {
2108 $self->{'_reportversion'} = $self->{'_last_data'};
2111 # Hsps are sort of weird, in that they end when another
2112 # object begins so have to detect this in end_element for now
2113 if ( $nm eq 'Hsp' ) {
2114 foreach (qw(Hsp_qseq Hsp_midline Hsp_hseq Hsp_features)) {
2115 $self->element(
2117 'Name' => $_,
2118 'Data' => $self->{'_last_hspdata'}->{$_}
2120 ) if defined $self->{'_last_hspdata'}->{$_};
2122 $self->{'_last_hspdata'} = {};
2123 $self->element(
2125 'Name' => 'Hsp_query-from',
2126 'Data' => $self->{'_Query'}->{'begin'}
2129 $self->element(
2131 'Name' => 'Hsp_query-to',
2132 'Data' => $self->{'_Query'}->{'end'}
2136 $self->element(
2138 'Name' => 'Hsp_hit-from',
2139 'Data' => $self->{'_Sbjct'}->{'begin'}
2142 $self->element(
2144 'Name' => 'Hsp_hit-to',
2145 'Data' => $self->{'_Sbjct'}->{'end'}
2149 # } elsif( $nm eq 'Iteration' ) {
2150 # Nothing special needs to be done here.
2152 if ( $type = $MODEMAP{$nm} ) {
2153 my $handler = $self->_will_handle($type);
2154 if ($handler) {
2155 my $func = sprintf( "end_%s", lc $type );
2156 $rc = $handler->$func( $self->{'_reporttype'}, $self->{'_values'} );
2158 shift @{ $self->{'_elements'} };
2161 elsif ( $MAPPING{$nm} ) {
2162 if ( ref( $MAPPING{$nm} ) =~ /hash/i ) {
2164 # this is where we shove in the data from the
2165 # hashref info about params or statistics
2166 my $key = ( keys %{ $MAPPING{$nm} } )[0];
2167 $self->{'_values'}->{$key}->{ $MAPPING{$nm}->{$key} } =
2168 $self->{'_last_data'};
2170 else {
2171 $self->{'_values'}->{ $MAPPING{$nm} } = $self->{'_last_data'};
2174 else {
2175 #$self->debug("blast.pm: unknown nm $nm, ignoring\n");
2177 $self->{'_last_data'} = ''; # remove read data if we are at
2178 # end of an element
2179 $self->{'_result'} = $rc if ( defined $type && $type eq 'result' );
2180 $self->{'_seen_hsp_features'} = 0;
2181 return $rc;
2184 =head2 element
2186 Title : element
2187 Usage : $eventhandler->element({'Name' => $name, 'Data' => $str});
2188 Function: Convenience method that calls start_element, characters, end_element
2189 Returns : none
2190 Args : Hash ref with the keys 'Name' and 'Data'
2193 =cut
2195 sub element {
2196 my ( $self, $data ) = @_;
2197 # Note that start element isn't needed for character data
2198 # Not too SAX-y, though
2199 #$self->start_element($data);
2200 $self->characters($data);
2201 $self->end_element($data);
2204 =head2 characters
2206 Title : characters
2207 Usage : $eventgenerator->characters($str)
2208 Function: Send a character events
2209 Returns : none
2210 Args : string
2213 =cut
2215 sub characters {
2216 my ( $self, $data ) = @_;
2217 if ( $self->in_element('hsp')
2218 && $data->{'Name'} =~ /^Hsp\_(qseq|hseq|midline)$/ )
2220 $self->{'_last_hspdata'}->{ $data->{'Name'} } .= $data->{'Data'}
2221 if defined $data->{'Data'};
2223 return unless ( defined $data->{'Data'} && $data->{'Data'} !~ /^\s+$/ );
2224 $self->{'_last_data'} = $data->{'Data'};
2227 =head2 within_element
2229 Title : within_element
2230 Usage : if( $eventgenerator->within_element($element) ) {}
2231 Function: Test if we are within a particular element
2232 This is different than 'in' because within can be tested
2233 for a whole block.
2234 Returns : boolean
2235 Args : string element name
2237 See Also: L<in_element>
2239 =cut
2241 sub within_element {
2242 my ( $self, $name ) = @_;
2243 return 0
2244 if ( !defined $name && !defined $self->{'_elements'}
2245 || scalar @{ $self->{'_elements'} } == 0 );
2246 foreach ( @{ $self->{'_elements'} } ) {
2247 if ( $_ eq $name ) {
2248 return 1;
2251 return 0;
2254 =head2 in_element
2256 Title : in_element
2257 Usage : if( $eventgenerator->in_element($element) ) {}
2258 Function: Test if we are in a particular element
2259 This is different than 'within_element' because within
2260 can be tested for a whole block.
2261 Returns : boolean
2262 Args : string element name
2264 See Also: L<within_element>
2266 =cut
2268 sub in_element {
2269 my ( $self, $name ) = @_;
2270 return 0 if !defined $self->{'_elements'}->[0];
2271 return ( $self->{'_elements'}->[0] eq $name );
2274 =head2 start_document
2276 Title : start_document
2277 Usage : $eventgenerator->start_document
2278 Function: Handle a start document event
2279 Returns : none
2280 Args : none
2283 =cut
2285 sub start_document {
2286 my ($self) = @_;
2287 $self->{'_lasttype'} = '';
2288 $self->{'_values'} = {};
2289 $self->{'_result'} = undef;
2290 $self->{'_elements'} = [];
2293 =head2 end_document
2295 Title : end_document
2296 Usage : $eventgenerator->end_document
2297 Function: Handles an end document event
2298 Returns : Bio::Search::Result::ResultI object
2299 Args : none
2302 =cut
2304 sub end_document {
2305 my ( $self, @args ) = @_;
2307 #$self->debug("blast.pm: end_document\n");
2308 return $self->{'_result'};
2311 sub write_result {
2312 my ( $self, $blast, @args ) = @_;
2314 if ( not defined( $self->writer ) ) {
2315 $self->warn("Writer not defined. Using a $DEFAULT_BLAST_WRITER_CLASS");
2316 $self->writer( $DEFAULT_BLAST_WRITER_CLASS->new() );
2318 $self->SUPER::write_result( $blast, @args );
2321 sub result_count {
2322 my $self = shift;
2323 return $self->{'_result_count'};
2326 sub report_count { shift->result_count }
2328 =head2 inclusion_threshold
2330 Title : inclusion_threshold
2331 Usage : my $incl_thresh = $isreb->inclusion_threshold;
2332 : $isreb->inclusion_threshold(1e-5);
2333 Function: Get/Set the e-value threshold for inclusion in the PSI-BLAST
2334 score matrix model (blastpgp) that was used for generating the reports
2335 being parsed.
2336 Returns : number (real)
2337 Default value: $Bio::SearchIO::IteratedSearchResultEventBuilder::DEFAULT_INCLUSION_THRESHOLD
2338 Args : number (real) (e.g., 0.0001 or 1e-4 )
2340 =cut
2342 =head2 max_significance
2344 Usage : $obj->max_significance();
2345 Purpose : Set/Get the P or Expect value used as significance screening cutoff.
2346 This is the value of the -signif parameter supplied to new().
2347 Hits with P or E-value above this are skipped.
2348 Returns : Scientific notation number with this format: 1.0e-05.
2349 Argument : Scientific notation number or float (when setting)
2350 Comments : Screening of significant hits uses the data provided on the
2351 : description line. For NCBI BLAST1 and WU-BLAST, this data
2352 : is P-value. for NCBI BLAST2 it is an Expect value.
2354 =cut
2356 =head2 signif
2358 Synonym for L<max_significance()|max_significance>
2360 =cut
2362 =head2 min_score
2364 Usage : $obj->min_score();
2365 Purpose : Set/Get the Blast score used as screening cutoff.
2366 This is the value of the -score parameter supplied to new().
2367 Hits with scores below this are skipped.
2368 Returns : Integer or scientific notation number.
2369 Argument : Integer or scientific notation number (when setting)
2370 Comments : Screening of significant hits uses the data provided on the
2371 : description line.
2373 =cut
2375 =head2 min_query_length
2377 Usage : $obj->min_query_length();
2378 Purpose : Gets the query sequence length used as screening criteria.
2379 This is the value of the -min_query_len parameter supplied to new().
2380 Hits with sequence length below this are skipped.
2381 Returns : Integer
2382 Argument : n/a
2384 =cut
2386 =head2 best_hit_only
2388 Title : best_hit_only
2389 Usage : print "only getting best hit.\n" if $obj->best_hit_only;
2390 Purpose : Set/Get the indicator for whether or not to process only
2391 : the best BlastHit.
2392 Returns : Boolean (1 | 0)
2393 Argument : Boolean (1 | 0) (when setting)
2395 =cut
2397 =head2 check_all_hits
2399 Title : check_all_hits
2400 Usage : print "checking all hits.\n" if $obj->check_all_hits;
2401 Purpose : Set/Get the indicator for whether or not to process all hits.
2402 : If false, the parser will stop processing hits after the
2403 : the first non-significance hit or the first hit that fails
2404 : any hit filter.
2405 Returns : Boolean (1 | 0)
2406 Argument : Boolean (1 | 0) (when setting)
2408 =cut
2410 # general private method used to make minimal hits from leftover
2411 # data in the hit table
2413 sub _cleanup_hits {
2414 my ($self, $hits) = @_;
2415 while ( my $v = shift @{ $hits }) {
2416 next unless defined $v;
2417 $self->start_element( { 'Name' => 'Hit' } );
2418 my $id = $v->[2];
2419 my $desc = $v->[3];
2420 $self->element(
2422 'Name' => 'Hit_id',
2423 'Data' => $id
2426 my ($gi, $acc, $version ) = $self->_get_seq_identifiers($id);
2427 $self->element(
2429 'Name' => 'Hit_accession',
2430 'Data' => $acc
2433 if ( defined $v ) {
2434 $self->element(
2436 'Name' => 'Hit_signif',
2437 'Data' => $v->[0]
2440 if (exists $self->{'_wublast'}) {
2441 $self->element(
2443 'Name' => 'Hit_score',
2444 'Data' => $v->[1]
2447 } else {
2448 $self->element(
2450 'Name' => 'Hit_bits',
2451 'Data' => $v->[1]
2457 $self->element(
2459 'Name' => 'Hit_def',
2460 'Data' => $desc
2463 $self->end_element( { 'Name' => 'Hit' } );
2470 __END__
2472 Developer Notes
2473 ---------------
2475 The following information is added in hopes of increasing the
2476 maintainability of this code. It runs the risk of becoming obsolete as
2477 the code gets updated. As always, double check against the actual
2478 source. If you find any discrepencies, please correct them.
2479 [ This documentation added on 3 Jun 2003. ]
2481 The logic is the brainchild of Jason Stajich, documented by Steve
2482 Chervitz. Jason: please check it over and modify as you see fit.
2484 Question:
2485 Elmo wants to know: How does this module unmarshall data from the input stream?
2486 (i.e., how does information from a raw input file get added to
2487 the correct Bioperl object?)
2489 Answer:
2491 This answer is specific to SearchIO::blast, but may apply to other
2492 SearchIO.pm subclasses as well. The following description gives the
2493 basic idea. The actual processing is a little more complex for
2494 certain types of data (HSP, Report Parameters).
2496 You can think of blast::next_result() as faking a SAX XML parser,
2497 making a non-XML document behave like its XML. The overhead to do this
2498 is quite substantial (~650 lines of code instead of ~80 in
2499 blastxml.pm).
2501 0. First, add a key => value pair for the datum of interest to %MAPPING
2502 Example:
2503 'Foo_bar' => 'Foo-bar',
2505 1. next_result() collects the datum of interest from the input stream,
2506 and calls element().
2507 Example:
2508 $self->element({ 'Name' => 'Foo_bar',
2509 'Data' => $foobar});
2511 2. The element() method is a convenience method that calls start_element(),
2512 characters(), and end_element().
2514 3. start_element() checks to see if the event handler can handle a start_xxx(),
2515 where xxx = the 'Name' parameter passed into element(), and calls start_xxx()
2516 if so. Otherwise, start_element() does not do anything.
2518 Data that will have such an event handler are defined in %MODEMAP.
2519 Typically, there are only handler methods for the main parts of
2520 the search result (e.g., Result, Iteration, Hit, HSP),
2521 which have corresponding Bioperl modules. So in this example,
2522 there was an earlier call such as $self->element({'Name'=>'Foo'})
2523 and the Foo_bar datum is meant to ultimately go into a Foo object.
2525 The start_foo() method in the handler will typically do any
2526 data initialization necessary to prepare for creating a new Foo object.
2527 Example: SearchResultEventBuilder::start_result()
2529 4. characters() takes the value of the 'Data' key from the hashref argument in
2530 the elements() call and saves it in a local data member:
2531 Example:
2532 $self->{'_last_data'} = $data->{'Data'};
2534 5. end_element() is like start_element() in that it does the check for whether
2535 the event handler can handle end_xxx() and if so, calls it, passing in
2536 the data collected from all of the characters() calls that occurred
2537 since the start_xxx() call.
2539 If there isn't any special handler for the data type specified by 'Name',
2540 end_element() will place the data saved by characters() into another
2541 local data member that saves it in a hash with a key defined by %MAPPING.
2542 Example:
2543 $nm = $data->{'Name'};
2544 $self->{'_values'}->{$MAPPING{$nm}} = $self->{'_last_data'};
2546 In this case, $MAPPING{$nm} is 'Foo-bar'.
2548 end_element() finishes by resetting the local data member used by
2549 characters(). (i.e., $self->{'_last_data'} = '';)
2551 6. When the next_result() method encounters the end of the Foo element in the
2552 input stream. It will invoke $self->end_element({'Name'=>'Foo'}).
2553 end_element() then sends all of the data in the $self->{'_values'} hash.
2554 Note that $self->{'_values'} is cleaned out during start_element(),
2555 keeping it at a resonable size.
2557 In the event handler, the end_foo() method takes the hash from end_element()
2558 and creates a new hash containing the same data, but having keys lacking
2559 the 'Foo' prefix (e.g., 'Foo-bar' becomes '-bar'). The handler's end_foo()
2560 method then creates the Foo object, passing in this new hash as an argument.
2561 Example: SearchResultEventBuilder::end_result()
2563 7. Objects created from the data in the search result are managed by
2564 the event handler which adds them to a ResultI object (using API methods
2565 for that object). The ResultI object gets passed back to
2566 SearchIO::end_element() when it calls end_result().
2568 The ResultI object is then saved in an internal data member of the
2569 SearchIO object, which returns it at the end of next_result()
2570 by calling end_document().
2572 (Technical Note: All objects created by end_xxx() methods in the event
2573 handler are returned to SearchIO::end_element(), but the SearchIO object
2574 only cares about the ResultI objects.)
2576 (Sesame Street aficionados note: This answer was NOT given by Mr. Noodle ;-P)