fix for neg. scores (courtesy of Bruno Vecchi)
[bioperl-run.git] / Bio / Tools / Run / Alignment / Clustalw.pm
blobea200349378c920c7ee88a248741a1a2204479fb
1 # $Id$
3 # BioPerl module for Bio::Tools::Run::Alignment::Clustalw
5 # You may distribute this module under the same terms as perl itself
6 # POD documentation - main docs before the code
8 =head1 NAME
10 Bio::Tools::Run::Alignment::Clustalw - Object for the calculation of a
11 multiple sequence alignment from a set of unaligned sequences or
12 alignments using the Clustalw program
14 =head1 SYNOPSIS
16 # Build a clustalw alignment factory
17 @params = ('ktuple' => 2, 'matrix' => 'BLOSUM');
18 $factory = Bio::Tools::Run::Alignment::Clustalw->new(@params);
20 # Pass the factory a list of sequences to be aligned.
21 $inputfilename = 't/data/cysprot.fa';
22 $aln = $factory->align($inputfilename); # $aln is a SimpleAlign object.
23 # or
24 $seq_array_ref = \@seq_array;
25 # where @seq_array is an array of Bio::Seq objects
26 $aln = $factory->align($seq_array_ref);
28 # Or one can pass the factory a pair of (sub)alignments
29 #to be aligned against each other, e.g.:
30 $aln = $factory->profile_align($aln1,$aln2);
31 # where $aln1 and $aln2 are Bio::SimpleAlign objects.
33 # Or one can pass the factory an alignment and one or more unaligned
34 # sequences to be added to the alignment. For example:
35 $aln = $factory->profile_align($aln1,$seq); # $seq is a Bio::Seq object.
37 # Get a tree of the sequences
38 $tree = $factory->tree(\@seq_array);
40 # Get both an alignment and a tree
41 ($aln, $tree) = $factory->run(\@seq_array);
43 # Do a footprinting analysis on the supplied sequences, getting back the
44 # most conserved sub-alignments
45 my @results = $factory->footprint(\@seq_array);
46 foreach my $result (@results) {
47 print $result->consensus_string, "\n";
50 # There are various additional options and input formats available.
51 # See the DESCRIPTION section that follows for additional details.
53 =head1 DESCRIPTION
55 Note: this DESCRIPTION only documents the Bioperl interface to
56 Clustalw. Clustalw, itself, is a large & complex program - for more
57 information regarding clustalw, please see the clustalw documentation
58 which accompanies the clustalw distribution. Clustalw is available
59 from (among others) ftp://ftp.ebi.ac.uk/pub/software/. Clustalw.pm has
60 only been tested using version 1.8 of clustalw. Compatibility with
61 earlier versions of the clustalw program is currently unknown. Before
62 running Clustalw successfully it will be necessary: to install clustalw
63 on your system, and to ensure that users have execute privileges for
64 the clustalw program.
66 =head2 Helping the module find your executable
68 You will need to enable Clustalw to find the clustalw program. This
69 can be done in (at least) three ways:
71 1. Make sure the clustalw executable is in your path so that
72 which clustalw
73 returns a clustalw executable on your system.
75 2. Define an environmental variable CLUSTALDIR which is a
76 directory which contains the 'clustalw' application:
77 In bash:
79 export CLUSTALDIR=/home/username/clustalw1.8
81 In csh/tcsh:
83 setenv CLUSTALDIR /home/username/clustalw1.8
85 3. Include a definition of an environmental variable CLUSTALDIR in
86 every script that will use this Clustalw wrapper module, e.g.:
88 BEGIN { $ENV{CLUSTALDIR} = '/home/username/clustalw1.8/' }
89 use Bio::Tools::Run::Alignment::Clustalw;
91 If you are running an application on a webserver make sure the
92 webserver environment has the proper PATH set or use the options 2 or
93 3 to set the variables.
95 =head2 How it works
97 Bio::Tools::Run::Alignment::Clustalw is an object for performing a
98 multiple sequence alignment from a set of unaligned sequences and/or
99 sub-alignments by means of the clustalw program.
101 Initially, a clustalw "factory object" is created. Optionally, the
102 factory may be passed most of the parameters or switches of the
103 clustalw program, e.g.:
105 @params = ('ktuple' => 2, 'matrix' => 'BLOSUM');
106 $factory = Bio::Tools::Run::Alignment::Clustalw->new(@params);
108 Any parameters not explicitly set will remain as the defaults of the
109 clustalw program. Additional parameters and switches (not available
110 in clustalw) may also be set. Currently, the only such parameter is
111 "quiet", which when set to a non-zero value, suppresses clustalw
112 terminal output. Not all clustalw parameters are supported at this
113 stage.
115 By default, Clustalw output is returned solely in a the form of a
116 Bio::SimpleAlign object which can then be printed and/or saved
117 in multiple formats using the AlignIO.pm module. Optionally the raw
118 clustalw output file can be saved if the calling script specifies an
119 output file (with the clustalw parameter OUTFILE). Currently only the
120 GCG-MSF output file formats is supported.
122 Not all parameters and features have been implemented yet in Perl format.
124 Alignment parameters can be changed and/or examined at any time after
125 the factory has been created. The program checks that any
126 parameter/switch being set/read is valid. However, currently no
127 additional checks are included to check that parameters are of the
128 proper type (eg string or numeric) or that their values are within the
129 proper range. As an example, to change the value of the clustalw
130 parameter ktuple to 3 and subsequently to check its value one would
131 write:
133 $ktuple = 3;
134 $factory->ktuple($ktuple);
135 $get_ktuple = $factory->ktuple();
137 Once the factory has been created and the appropriate parameters set,
138 one can call the method align() to align a set of unaligned sequences,
139 or call profile_align() to add one or more sequences or a second
140 alignment to an initial alignment.
142 Input to align() may consist of a set of unaligned sequences in the
143 form of the name of file containing the sequences. For example,
145 $inputfilename = 't/data/cysprot.fa';
146 $aln = $factory-E<gt>align($inputfilename);
148 Alternately one can create an array of Bio::Seq objects somehow
150 $str = Bio::SeqIO->new(-file=> 't/data/cysprot.fa', -format => 'Fasta');
151 @seq_array =();
152 while ( my $seq = $str->next_seq() ) {push (@seq_array, $seq) ;}
154 and pass the factory a reference to that array
156 $seq_array_ref = \@seq_array;
157 $aln = $factory->align($seq_array_ref);
159 In either case, align() returns a reference to a SimpleAlign object
160 which can then used (see L<Bio::SimpleAlign>).
162 Once an initial alignment exists, one can pass the factory additional
163 sequence(s) to be added (ie aligned) to the original alignment. The
164 alignment can be passed as either an alignment file or a
165 Bio:SimpleAlign object. The unaligned sequence(s) can be passed as a
166 filename or as an array of BioPerl sequence objects or as a single
167 BioPerl Seq object. For example (to add a single sequence to an
168 alignment),
170 $str = Bio::AlignIO->new(-file=> 't/data/cysprot1a.msf');
171 $aln = $str->next_aln();
172 $str1 = Bio::SeqIO->new(-file=> 't/data/cysprot1b.fa');
173 $seq = $str1->next_seq();
174 $aln = $factory->profile_align($aln,$seq);
176 In either case, profile_align() returns a reference to a SimpleAlign
177 object containing a new SimpleAlign object of the alignment with the
178 additional sequence(s) added in.
180 Finally one can pass the factory a pair of (sub)alignments to be
181 aligned against each other. The alignments can be passed in the form
182 of either a pair of alignment files or a pair of Bio:SimpleAlign
183 objects. For example,
185 $profile1 = 't/data/cysprot1a.msf';
186 $profile2 = 't/data/cysprot1b.msf';
187 $aln = $factory->profile_align($profile1,$profile2);
191 $str1 = Bio::AlignIO->new(-file=> 't/data/cysprot1a.msf');
192 $aln1 = $str1->next_aln();
193 $str2 = Bio::AlignIO->new(-file=> 't/data/cysprot1b.msf');
194 $aln2 = $str2->next_aln();
195 $aln = $factory->profile_align($aln1,$aln2);
197 In either case, profile_align() returns a reference to a SimpleAlign
198 object containing an (super)alignment of the two input alignments.
200 For more examples of syntax and use of Clustalw, the user is
201 encouraged to look at the script Clustalw.t in the t/ directory.
203 Note: Clustalw is still under development. Various features of the
204 clustalw program have not yet been implemented. If you would like
205 that a specific clustalw feature be added to this perl contact
206 bioperl-l@bioperl.org.
208 These can be specified as paramters when instantiating a new Clustalw
209 object, or through get/set methods of the same name (lowercase).
211 =head1 PARAMETER FOR ALIGNMENT COMPUTATION
213 =head2 KTUPLE
215 Title : KTUPLE
216 Description : (optional) set the word size to be used in the alignment
217 This is the size of exactly matching fragment that is used.
218 INCREASE for speed (max= 2 for proteins; 4 for DNA),
219 DECREASE for sensitivity.
220 For longer sequences (e.g. >1000 residues) you may
221 need to increase the default
223 =head2 TOPDIAGS
225 Title : TOPDIAGS
226 Description : (optional) number of best diagonals to use
227 The number of k-tuple matches on each diagonal
228 (in an imaginary dot-matrix plot) is calculated.
229 Only the best ones (with most matches) are used in
230 the alignment. This parameter specifies how many.
231 Decrease for speed; increase for sensitivity.
233 =head2 WINDOW
235 Title : WINDOW
236 Description : (optional) window size
237 This is the number of diagonals around each of the 'best'
238 diagonals that will be used. Decrease for speed;
239 increase for sensitivity.
241 =head2 PAIRGAP
243 Title : PAIRGAP
244 Description : (optional) gap penalty for pairwise alignments
245 This is a penalty for each gap in the fast alignments.
246 It has little affect on the speed or sensitivity except
247 for extreme values.
249 =head2 FIXEDGAP
251 Title : FIXEDGAP
252 Description : (optional) fixed length gap penalty
254 =head2 FLOATGAP
256 Title : FLOATGAP
257 Description : (optional) variable length gap penalty
259 =head2 MATRIX
261 Title : MATRIX
262 Default : PAM100 for DNA - PAM250 for protein alignment
263 Description : (optional) substitution matrix used in the multiple
264 alignments. Depends on the version of clustalw as to
265 what default matrix will be used
267 PROTEIN WEIGHT MATRIX leads to a new menu where you are
268 offered a choice of weight matrices. The default for
269 proteins in version 1.8 is the PAM series derived by
270 Gonnet and colleagues. Note, a series is used! The
271 actual matrix that is used depends on how similar the
272 sequences to be aligned at this alignment step
273 are. Different matrices work differently at each
274 evolutionary distance.
276 DNA WEIGHT MATRIX leads to a new menu where a single
277 matrix (not a series) can be selected. The default is
278 the matrix used by BESTFIT for comparison of nucleic
279 acid sequences.
281 =head2 TYPE
283 Title : TYPE
284 Description : (optional) sequence type: protein or DNA. This allows
285 you to explicitly overide the programs attempt at
286 guessing the type of the sequence. It is only useful
287 if you are using sequences with a VERY strange
288 composition.
290 =head2 OUTPUT
292 Title : OUTPUT
293 Description : (optional) clustalw supports GCG or PHYLIP or PIR or
294 Clustal format. See the Bio::AlignIO modules for
295 which formats are supported by bioperl.
297 =head2 OUTFILE
299 Title : OUTFILE
300 Description : (optional) Name of clustalw output file. If not set
301 module will erase output file. In any case alignment will
302 be returned in the form of SimpleAlign objects
304 =head2 TRANSMIT
306 Title : TRANSMIT
307 Description : (optional) transitions not weighted. The default is to
308 weight transitions as more favourable than other
309 mismatches in DNA alignments. This switch makes all
310 nucleotide mismatches equally weighted.
312 =head1 FEEDBACK
314 =head2 Mailing Lists
316 User feedback is an integral part of the evolution of this and other
317 Bioperl modules. Send your comments and suggestions preferably to one
318 of the Bioperl mailing lists. Your participation is much appreciated.
320 bioperl-l@bioperl.org - General discussion
321 http://bioperl.org/wiki/Mailing_lists - About the mailing lists
323 =head2 Reporting Bugs
325 Report bugs to the Bioperl bug tracking system to help us keep track
326 the bugs and their resolution. Bug reports can be submitted via the
327 web:
329 http://bugzilla.open-bio.org/
331 =head1 AUTHOR - Peter Schattner
333 Email schattner@alum.mit.edu
335 =head1 CONTRIBUTORS
337 Jason Stajich jason-AT-bioperl_DOT_org
338 Sendu Bala bix@sendu.me.uk
340 =head1 APPENDIX
342 The rest of the documentation details each of the object
343 methods. Internal methods are usually preceded with a _
345 =cut
349 package Bio::Tools::Run::Alignment::Clustalw;
351 use strict;
352 use Bio::Seq;
353 use Bio::SeqIO;
354 use Bio::SimpleAlign;
355 use Bio::AlignIO;
356 use Bio::TreeIO;
357 use Bio::Root::IO;
359 use base qw(Bio::Root::Root Bio::Tools::Run::WrapperBase);
361 our @CLUSTALW_PARAMS = qw(output ktuple topdiags window pairgap fixedgap
362 floatgap matrix type transit dnamatrix outfile
363 gapopen gapext maxdiv gapdist hgapresidues pwmatrix
364 pwdnamatrix pwgapopen pwgapext score transweight
365 seed helixgap outorder strandgap loopgap terminalgap
366 helixendin helixendout strandendin strandendout program
367 reps outputtree seed bootlabels bootstrap);
369 our @CLUSTALW_SWITCHES = qw(help check options negative noweights endgaps
370 nopgap nohgap novgap kimura tossgaps
371 kimura tossgaps njtree);
372 our @OTHER_SWITCHES = qw(quiet);
373 our $PROGRAM_NAME = 'clustalw';
374 our $PROGRAM_DIR = $ENV{'CLUSTALDIR'} || $ENV{'CLUSTALWDIR'};
377 =head2 program_name
379 Title : program_name
380 Usage : $factory>program_name()
381 Function: holds the program name
382 Returns: string
383 Args : None
385 =cut
387 sub program_name {
388 return $PROGRAM_NAME;
391 =head2 program_dir
393 Title : program_dir
394 Usage : $factory->program_dir(@params)
395 Function: returns the program directory, obtained from ENV variable.
396 Returns: string
397 Args :
399 =cut
401 sub program_dir {
402 return $PROGRAM_DIR;
405 sub new {
406 my ($class,@args) = @_;
407 my $self = $class->SUPER::new(@args);
409 $self->_set_from_args(\@args, -methods => [@CLUSTALW_PARAMS,
410 @CLUSTALW_SWITCHES,
411 @OTHER_SWITCHES],
412 -create => 1);
414 return $self;
417 =head2 version
419 Title : version
420 Usage : exit if $prog->version() < 1.8
421 Function: Determine the version number of the program
422 Example :
423 Returns : float or undef
424 Args : none
426 =cut
428 sub version {
429 my ($self) = @_;
431 return undef unless $self->executable;
432 my $prog = $self->executable;
433 my $string = `$prog --` ;
434 $string =~ /\(?([\d.]+)\)?/xms;
435 return $1 || undef;
438 =head2 run
440 Title : run
441 Usage : ($aln, $tree) = $factory->run($inputfilename);
442 ($aln, $tree) = $factory->run($seq_array_ref);
443 Function: Perform a multiple sequence alignment, generating a tree at the same
444 time. (Like align() and tree() combined.)
445 Returns : A SimpleAlign object containing the sequence alignment and a
446 Bio::Tree::Tree object with the tree relating the sequences.
447 Args : Name of a file containing a set of unaligned fasta sequences
448 or else an array of references to Bio::Seq objects.
450 =cut
452 sub run {
453 my ($self,$input) = @_;
454 my ($temp,$infilename, $seq);
455 my ($attr, $value, $switch);
457 $self->io->_io_cleanup();
458 # Create input file pointer
459 $infilename = $self->_setinput($input);
460 $self->throw("Bad input data (sequences need an id) or less than 2 sequences in $input!") unless $infilename;
462 # Create parameter string to pass to clustalw program
463 my $param_string = $self->_setparams();
465 # run clustalw
466 return $self->_run('both', $infilename, $param_string);
469 =head2 align
471 Title : align
472 Usage : $inputfilename = 't/data/cysprot.fa';
473 $aln = $factory->align($inputfilename);
475 $seq_array_ref = \@seq_array; # @seq_array is array of Seq objs
476 $aln = $factory->align($seq_array_ref);
477 Function: Perform a multiple sequence alignment
478 Returns : Reference to a SimpleAlign object containing the
479 sequence alignment.
480 Args : Name of a file containing a set of unaligned fasta sequences
481 or else an array of references to Bio::Seq objects.
483 Throws an exception if argument is not either a string (eg a
484 filename) or a reference to an array of Bio::Seq objects. If
485 argument is string, throws exception if file corresponding to string
486 name can not be found. If argument is Bio::Seq array, throws
487 exception if less than two sequence objects are in array.
489 =cut
491 sub align {
492 my ($self,$input) = @_;
494 $self->io->_io_cleanup();
496 # Create input file pointer
497 my $infilename = $self->_setinput($input);
498 $self->throw("Bad input data (sequences need an id ) or less than 2 sequences in $input !") unless $infilename;
500 # Create parameter string to pass to clustalw program
501 my $param_string = $self->_setparams();
503 # run clustalw
504 my $aln = $self->_run('align', $infilename, $param_string);
508 =head2 profile_align
510 Title : profile_align
511 Usage : $aln = $factory->profile_align(@simple_aligns);
513 $aln = $factory->profile_align(@subalignment_filenames);
514 Function: Perform an alignment of 2 (sub)alignments
515 Returns : Reference to a SimpleAlign object containing the (super)alignment.
516 Args : Names of 2 files containing the subalignments
517 or references to 2 Bio::SimpleAlign objects.
519 Throws an exception if arguments are not either strings (eg filenames)
520 or references to SimpleAlign objects.
522 =cut
524 sub profile_align {
525 my ($self,$input1,$input2) = @_;
527 $self->io->_io_cleanup();
529 # Create input file pointer
530 my $infilename1 = $self->_setinput($input1, 1);
531 my $infilename2 = $self->_setinput($input2, 2);
532 if (!$infilename1 || !$infilename2) {$self->throw("Bad input data: $input1 or $input2 !");}
533 unless ( -e $infilename1 and -e $infilename2) {$self->throw("Bad input file: $input1 or $input2 !");}
535 # Create parameter string to pass to clustalw program
536 my $param_string = $self->_setparams();
538 # run clustalw
539 my $aln = $self->_run('profile-aln', $infilename1, $infilename2, $param_string);
542 =head2 add_sequences
544 Title : add_sequences
545 Usage :
546 Function: Align and add sequences into an alignment
547 Example :
548 Returns : Reference to a SimpleAlign object containing the (super)alignment.
549 Args : Names of 2 files, the first one containing an alignment and the second one containing sequences to be added
550 or references to 2 Bio::SimpleAlign objects.
552 Throws an exception if arguments are not either strings (eg filenames)
553 or references to SimpleAlign objects.
556 =cut
558 sub add_sequences {
560 my ($self,$input1,$input2) = @_;
561 my ($temp,$infilename1,$infilename2,$input,$seq);
563 $self->io->_io_cleanup();
564 # Create input file pointer
565 $infilename1 = $self->_setinput($input1,1);
566 $infilename2 = $self->_setinput($input2,2);
567 if (!$infilename1 || !$infilename2) {$self->throw("Bad input data: $input1 or $input2 !");}
568 unless ( -e $infilename1 and -e $infilename2) {$self->throw("Bad input file: $input1 or $input2 !");}
571 # Create parameter string to pass to clustalw program
572 my $param_string = $self->_setparams();
573 # run clustalw
574 my $aln = $self->_run('add_sequences', $infilename1,
575 $infilename2, $param_string);
579 =head2 tree
581 Title : tree
582 Usage : @params = ('bootstrap' => 1000,
583 'tossgaps' => 1,
584 'kimura' => 1,
585 'seed' => 121,
586 'bootlabels'=> 'nodes',
587 'quiet' => 1);
588 $factory = Bio::Tools::Run::Alignment::Clustalw->new(@params);
589 $tree_obj = $factory->tree($aln_obj);
591 $tree_obj = $factory->tree($treefilename);
592 Function: Retrieve a tree corresponding to the input
593 Returns : Bio::TreeIO object
594 Args : Bio::SimpleAlign or filename of a tree
596 =cut
598 sub tree {
599 my ($self,$input) = @_;
601 $self->io->_io_cleanup();
603 # Create input file pointer
604 my $infilename = $self->_setinput($input);
606 if (!$infilename) {$self->throw("Bad input data (sequences need an id ) or less than 2 sequences in $input !");}
608 # Create parameter string to pass to clustalw program
609 my $param_string = $self->_setparams();
611 # run clustalw
612 my $tree = $self->_run('tree', $infilename, $param_string);
615 =head2 footprint
617 Title : footprint
618 Usage : @alns = $factory->footprint($treefilename, $window_size, $diff);
619 @alns = $factory->footprint($seqs_array_ref);
620 Function: Aligns all the supplied sequences and slices out of the alignment
621 those regions along a sliding window who's tree length differs
622 significantly from the total average tree length.
623 Returns : list of Bio::SimpleAlign objects
624 Args : first argument as per run(), optional second argument to specify
625 the size of the sliding window (default 5 bp) and optional third
626 argument to specify the % difference from the total tree length
627 needed for a window to be considered a footprint (default 33%).
629 =cut
631 sub footprint {
632 my ($self, $in, $slice_size, $deviate) = @_;
634 my ($simplealn, $tree) = $self->run($in);
636 # total tree length?
637 my $total_length = $tree->total_branch_length;
639 # tree length along sliding window, picking regions that significantly
640 # deviate from the average tree length
641 $slice_size ||= 5;
642 $deviate ||= 33;
643 my $threshold = $total_length - (($total_length / 100) * $deviate);
644 my $length = $simplealn->length;
645 my $below = 0;
646 my $found_minima = 0;
647 my $minima = [$threshold, ''];
648 my @results;
649 for my $i (1..($length - $slice_size + 1)) {
650 my $slice = $simplealn->slice($i, ($i + $slice_size - 1), 1);
651 my $tree = $self->tree($slice);
652 my $slice_length = $tree->total_branch_length;
654 $slice_length <= $threshold ? ($below = 1) : ($below = 0);
655 if ($below) {
656 unless ($found_minima) {
657 if ($slice_length < ${$minima}[0]) {
658 $minima = [$slice_length, $slice];
660 else {
661 push(@results, ${$minima}[1]);
662 $minima = [$threshold, ''];
663 $found_minima = 1;
667 else {
668 $found_minima = 0;
672 return @results;
675 =head2 _run
677 Title : _run
678 Usage : Internal function, not to be called directly
679 Function: makes actual system call to clustalw program
680 Returns : nothing; clustalw output is written to a
681 temporary file
682 Args : Name of a file containing a set of unaligned fasta sequences
683 and hash of parameters to be passed to clustalw
685 =cut
687 sub _run {
688 my ($self, $command, $infile1, $infile2, $param_string) = @_;
690 my ($instring, $tree);
691 my $quiet = $self->quiet() || $self->verbose() < 0;
693 if ($command =~ /align|both/) {
694 if ($^O eq 'dec_osf') {
695 $instring = $infile1;
696 $command = '';
698 else {
699 $instring = " -infile=$infile1";
701 $param_string .= " $infile2";
704 if ($command =~ /profile/) {
705 $instring = "-profile1=$infile1 -profile2=$infile2";
706 chmod 0777, $infile1, $infile2;
707 $command = '-profile';
710 if ($command =~ /add_sequences/) {
711 $instring = "-profile1=$infile1 -profile2=$infile2";
712 chmod 0777, $infile1,$infile2;
713 $command = '-sequences';
716 if ($command =~ /tree/) {
717 if( $^O eq 'dec_osf' ) {
718 $instring = $infile1;
719 $command = '';
721 else {
722 $instring = " $infile1";
724 $param_string .= " $infile2";
726 $self->debug( "Program ".$self->executable."\n");
727 my $commandstring = $self->executable."$instring"."$param_string";
728 $commandstring .= ' 1>/dev/null' if $quiet;
729 $self->debug( "clustal command = $commandstring");
731 my $status = system($commandstring);
732 unless( $status == 0 ) {
733 $self->warn( "Clustalw call ($commandstring) crashed: $? \n");
734 return undef;
737 return $self->_get_tree($infile1, $param_string);
740 my $output = $self->output || 'gcg';
741 $self->debug( "Program ".$self->executable."\n");
742 my $commandstring = $self->executable." $command"." $instring"." -output=$output". " $param_string";
743 $self->debug( "clustal command = $commandstring");
745 open(my $pipe, "$commandstring |") || $self->throw("ClustalW call ($commandstring) failed to start: $? | $!");
746 my $score;
747 while (<$pipe>) {
748 print unless $quiet;
749 # Kevin Brown suggested the following regex, though it matches multiple
750 # times: we pick up the last one
751 $score = $1 if ($_ =~ /Score:(\d+)/);
752 # This one is printed at the end and seems the most appropriate to pick
753 # up; we include the above regex incase 'Alignment Score' isn't given
754 $score = $1 if ($_ =~ /Alignment Score (-?\d+)/);
756 close($pipe) || ($self->throw("ClustalW call ($commandstring) crashed: $?"));
758 my $outfile = $self->outfile();
760 # retrieve alignment (Note: MSF format for AlignIO = GCG format of clustalw)
761 my $format = $output =~ /gcg/i ? 'msf' : $output;
762 if ($format =~ /clustal/i) {
763 $format = 'clustalw'; # force clustalw incase 'clustal' is requested
765 my $in = Bio::AlignIO->new(-file => $outfile, -format=> $format);
766 my $aln = $in->next_aln();
767 $in->close;
768 $aln->score($score);
770 if ($command eq 'both') {
771 $tree = $self->_get_tree($infile1, $param_string);
774 # Clean up the temporary files created along the way...
775 # Replace file suffix with dnd to find name of dendrogram file(s) to delete
776 unless ( $self->save_tempfiles ) {
777 foreach my $f ($infile1, $infile2) {
778 $f =~ s/\.[^\.]*$// ;
779 unlink $f .'.dnd' if ($f ne '');
783 if ($command eq 'both') {
784 return ($aln, $tree);
786 return $aln;
789 sub _get_tree {
790 my ($self, $treefile, $param_string) = @_;
792 if ($param_string =~ /-bootstrap/) {
793 $treefile .= '.phb';
795 elsif ($param_string =~ /-tree/) {
796 $treefile .= '.ph';
798 else {
799 $treefile .= '.dnd';
802 my $in = Bio::TreeIO->new('-file' => $treefile,
803 '-format'=> 'newick');
805 my $tree = $in->next_tree;
806 unless ( $self->save_tempfiles ) {
807 foreach my $f ( $treefile ) {
808 $f =~ s/\.[^\.]*$// ;
809 unlink $f if( $f ne '' );
813 return $tree;
816 =head2 _setinput()
818 Title : _setinput
819 Usage : Internal function, not to be called directly
820 Function: Create input file for clustalw program
821 Returns : name of file containing clustalw data input
822 Args : Seq or Align object reference or input file name
824 =cut
826 sub _setinput {
827 my ($self, $input, $suffix) = @_;
828 my ($infilename, $seq, $temp, $tfh);
830 # suffix is used to distinguish alignment files If $input is not a
831 # reference it better be the name of a file with the sequence/
833 # alignment data...
834 unless (ref $input) {
835 # check that file exists or throw
836 $infilename = $input;
837 return unless -e $input;
838 return $infilename;
841 # $input may be an array of BioSeq objects...
842 if (ref($input) eq "ARRAY") {
843 # Open temporary file for both reading & writing of BioSeq array
844 ($tfh,$infilename) = $self->io->tempfile(-dir=>$self->tempdir);
845 $temp = Bio::SeqIO->new('-fh'=>$tfh, '-format' =>'Fasta');
847 # Need at least 2 seqs for alignment
848 return unless (scalar(@$input) > 1);
850 foreach $seq (@$input) {
851 return unless (defined $seq && $seq->isa("Bio::PrimarySeqI") and $seq->id());
852 $temp->write_seq($seq);
854 $temp->close();
855 close($tfh);
856 undef $tfh;
857 return $infilename;
860 # $input may be a SimpleAlign object.
861 elsif (ref($input) eq "Bio::SimpleAlign") {
862 # Open temporary file for both reading & writing of SimpleAlign object
863 ($tfh,$infilename) = $self->io->tempfile(-dir=>$self->tempdir);
864 $temp = Bio::AlignIO->new('-fh'=> $tfh, '-format' => 'fasta');
865 $temp->write_aln($input);
866 close($tfh);
867 undef $tfh;
868 return $infilename;
871 # or $input may be a single BioSeq object (to be added to a previous alignment)
872 elsif (ref($input) && $input->isa("Bio::PrimarySeqI") && $suffix==2) {
873 # Open temporary file for both reading & writing of BioSeq object
874 ($tfh,$infilename) = $self->io->tempfile();
875 $temp = Bio::SeqIO->new(-fh=> $tfh, '-format' =>'Fasta');
876 $temp->write_seq($input);
877 close($tfh);
878 undef $tfh;
879 return $infilename;
882 return;
885 =head2 _setparams()
887 Title : _setparams
888 Usage : Internal function, not to be called directly
889 Function: Create parameter inputs for clustalw program
890 Returns : parameter string to be passed to clustalw
891 during align or profile_align
892 Args : name of calling object
894 =cut
896 sub _setparams {
897 my $self = shift;
899 my $param_string = $self->SUPER::_setparams(-params => \@CLUSTALW_PARAMS,
900 -switches => \@CLUSTALW_SWITCHES,
901 -dash => 1,
902 -lc => 1,
903 -join => '=');
905 # Set default output file if no explicit output file selected
906 unless ($param_string =~ /outfile/) {
907 my ($tfh, $outfile) = $self->io->tempfile(-dir => $self->tempdir());
908 close($tfh);
909 undef $tfh;
910 $self->outfile($outfile);
911 $param_string .= " -outfile=$outfile" ;
914 $param_string .= ' 2>&1';
916 return $param_string;