speelink fixes, patch courtesy Charles Plessy, fixes #3256
[bioperl-run.git] / lib / Bio / Tools / Run / Alignment / Clustalw.pm
blobe9afe69c73247e0188d8f699515ac8fb1ed5539b
1 # $Id$
3 # BioPerl module for Bio::Tools::Run::Alignment::Clustalw
5 # You may distribute this module under the same terms as perl itself
6 # POD documentation - main docs before the code
8 =head1 NAME
10 Bio::Tools::Run::Alignment::Clustalw - Object for the calculation of a
11 multiple sequence alignment from a set of unaligned sequences or
12 alignments using the Clustalw program
14 =head1 SYNOPSIS
16 # Build a clustalw alignment factory
17 @params = ('ktuple' => 2, 'matrix' => 'BLOSUM');
18 $factory = Bio::Tools::Run::Alignment::Clustalw->new(@params);
20 # Pass the factory a list of sequences to be aligned.
21 $inputfilename = 't/data/cysprot.fa';
22 $aln = $factory->align($inputfilename); # $aln is a SimpleAlign object.
23 # or
24 $seq_array_ref = \@seq_array;
25 # where @seq_array is an array of Bio::Seq objects
26 $aln = $factory->align($seq_array_ref);
28 # Or one can pass the factory a pair of (sub)alignments
29 #to be aligned against each other, e.g.:
30 $aln = $factory->profile_align($aln1,$aln2);
31 # where $aln1 and $aln2 are Bio::SimpleAlign objects.
33 # Or one can pass the factory an alignment and one or more unaligned
34 # sequences to be added to the alignment. For example:
35 $aln = $factory->profile_align($aln1,$seq); # $seq is a Bio::Seq object.
37 # Get a tree of the sequences
38 $tree = $factory->tree(\@seq_array);
40 # Get both an alignment and a tree
41 ($aln, $tree) = $factory->run(\@seq_array);
43 # Do a footprinting analysis on the supplied sequences, getting back the
44 # most conserved sub-alignments
45 my @results = $factory->footprint(\@seq_array);
46 foreach my $result (@results) {
47 print $result->consensus_string, "\n";
50 # There are various additional options and input formats available.
51 # See the DESCRIPTION section that follows for additional details.
53 =head1 DESCRIPTION
55 Note: this DESCRIPTION only documents the Bioperl interface to
56 Clustalw. Clustalw, itself, is a large & complex program - for more
57 information regarding clustalw, please see the clustalw documentation
58 which accompanies the clustalw distribution. Clustalw is available
59 from (among others) ftp://ftp.ebi.ac.uk/pub/software/. Clustalw.pm has
60 only been tested using version 1.8 of clustalw. Compatibility with
61 earlier versions of the clustalw program is currently unknown. Before
62 running Clustalw successfully it will be necessary: to install clustalw
63 on your system, and to ensure that users have execute privileges for
64 the clustalw program.
66 =head2 Helping the module find your executable
68 You will need to enable Clustalw to find the clustalw program. This
69 can be done in (at least) three ways:
71 1. Make sure the clustalw executable is in your path so that
72 which clustalw
73 returns a clustalw executable on your system.
75 2. Define an environmental variable CLUSTALDIR which is a
76 directory which contains the 'clustalw' application:
77 In bash:
79 export CLUSTALDIR=/home/username/clustalw1.8
81 In csh/tcsh:
83 setenv CLUSTALDIR /home/username/clustalw1.8
85 3. Include a definition of an environmental variable CLUSTALDIR in
86 every script that will use this Clustalw wrapper module, e.g.:
88 BEGIN { $ENV{CLUSTALDIR} = '/home/username/clustalw1.8/' }
89 use Bio::Tools::Run::Alignment::Clustalw;
91 If you are running an application on a webserver make sure the
92 webserver environment has the proper PATH set or use the options 2 or
93 3 to set the variables.
95 =head2 How it works
97 Bio::Tools::Run::Alignment::Clustalw is an object for performing a
98 multiple sequence alignment from a set of unaligned sequences and/or
99 sub-alignments by means of the clustalw program.
101 Initially, a clustalw "factory object" is created. Optionally, the
102 factory may be passed most of the parameters or switches of the
103 clustalw program, e.g.:
105 @params = ('ktuple' => 2, 'matrix' => 'BLOSUM');
106 $factory = Bio::Tools::Run::Alignment::Clustalw->new(@params);
108 Any parameters not explicitly set will remain as the defaults of the
109 clustalw program. Additional parameters and switches (not available
110 in clustalw) may also be set. Currently, the only such parameter is
111 "quiet", which when set to a non-zero value, suppresses clustalw
112 terminal output. Not all clustalw parameters are supported at this
113 stage.
115 By default, Clustalw output is returned solely in a the form of a
116 Bio::SimpleAlign object which can then be printed and/or saved
117 in multiple formats using the AlignIO.pm module. Optionally the raw
118 clustalw output file can be saved if the calling script specifies an
119 output file (with the clustalw parameter OUTFILE). Currently only the
120 GCG-MSF output file formats is supported.
122 Not all parameters and features have been implemented yet in Perl format.
124 Alignment parameters can be changed and/or examined at any time after
125 the factory has been created. The program checks that any
126 parameter/switch being set/read is valid. However, currently no
127 additional checks are included to check that parameters are of the
128 proper type (eg string or numeric) or that their values are within the
129 proper range. As an example, to change the value of the clustalw
130 parameter ktuple to 3 and subsequently to check its value one would
131 write:
133 $ktuple = 3;
134 $factory->ktuple($ktuple);
135 $get_ktuple = $factory->ktuple();
137 Once the factory has been created and the appropriate parameters set,
138 one can call the method align() to align a set of unaligned sequences,
139 or call profile_align() to add one or more sequences or a second
140 alignment to an initial alignment.
142 Input to align() may consist of a set of unaligned sequences in the
143 form of the name of file containing the sequences. For example,
145 $inputfilename = 't/data/cysprot.fa';
146 $aln = $factory-E<gt>align($inputfilename);
148 Alternately one can create an array of Bio::Seq objects somehow
150 $str = Bio::SeqIO->new(-file=> 't/data/cysprot.fa', -format => 'Fasta');
151 @seq_array =();
152 while ( my $seq = $str->next_seq() ) {push (@seq_array, $seq) ;}
154 and pass the factory a reference to that array
156 $seq_array_ref = \@seq_array;
157 $aln = $factory->align($seq_array_ref);
159 In either case, align() returns a reference to a SimpleAlign object
160 which can then used (see L<Bio::SimpleAlign>).
162 Once an initial alignment exists, one can pass the factory additional
163 sequence(s) to be added (ie aligned) to the original alignment. The
164 alignment can be passed as either an alignment file or a
165 Bio:SimpleAlign object. The unaligned sequence(s) can be passed as a
166 filename or as an array of BioPerl sequence objects or as a single
167 BioPerl Seq object. For example (to add a single sequence to an
168 alignment),
170 $str = Bio::AlignIO->new(-file=> 't/data/cysprot1a.msf');
171 $aln = $str->next_aln();
172 $str1 = Bio::SeqIO->new(-file=> 't/data/cysprot1b.fa');
173 $seq = $str1->next_seq();
174 $aln = $factory->profile_align($aln,$seq);
176 In either case, profile_align() returns a reference to a SimpleAlign
177 object containing a new SimpleAlign object of the alignment with the
178 additional sequence(s) added in.
180 Finally one can pass the factory a pair of (sub)alignments to be
181 aligned against each other. The alignments can be passed in the form
182 of either a pair of alignment files or a pair of Bio:SimpleAlign
183 objects. For example,
185 $profile1 = 't/data/cysprot1a.msf';
186 $profile2 = 't/data/cysprot1b.msf';
187 $aln = $factory->profile_align($profile1,$profile2);
191 $str1 = Bio::AlignIO->new(-file=> 't/data/cysprot1a.msf');
192 $aln1 = $str1->next_aln();
193 $str2 = Bio::AlignIO->new(-file=> 't/data/cysprot1b.msf');
194 $aln2 = $str2->next_aln();
195 $aln = $factory->profile_align($aln1,$aln2);
197 In either case, profile_align() returns a reference to a SimpleAlign
198 object containing an (super)alignment of the two input alignments.
200 For more examples of syntax and use of Clustalw, the user is
201 encouraged to look at the script Clustalw.t in the t/ directory.
203 Note: Clustalw is still under development. Various features of the
204 clustalw program have not yet been implemented. If you would like
205 that a specific clustalw feature be added to this perl contact
206 bioperl-l@bioperl.org.
208 These can be specified as parameters when instantiating a new Clustalw
209 object, or through get/set methods of the same name (lowercase).
211 =head1 PARAMETER FOR ALIGNMENT COMPUTATION
213 =head2 KTUPLE
215 Title : KTUPLE
216 Description : (optional) set the word size to be used in the alignment
217 This is the size of exactly matching fragment that is used.
218 INCREASE for speed (max= 2 for proteins; 4 for DNA),
219 DECREASE for sensitivity.
220 For longer sequences (e.g. >1000 residues) you may
221 need to increase the default
223 =head2 TOPDIAGS
225 Title : TOPDIAGS
226 Description : (optional) number of best diagonals to use
227 The number of k-tuple matches on each diagonal
228 (in an imaginary dot-matrix plot) is calculated.
229 Only the best ones (with most matches) are used in
230 the alignment. This parameter specifies how many.
231 Decrease for speed; increase for sensitivity.
233 =head2 WINDOW
235 Title : WINDOW
236 Description : (optional) window size
237 This is the number of diagonals around each of the 'best'
238 diagonals that will be used. Decrease for speed;
239 increase for sensitivity.
241 =head2 PAIRGAP
243 Title : PAIRGAP
244 Description : (optional) gap penalty for pairwise alignments
245 This is a penalty for each gap in the fast alignments.
246 It has little affect on the speed or sensitivity except
247 for extreme values.
249 =head2 FIXEDGAP
251 Title : FIXEDGAP
252 Description : (optional) fixed length gap penalty
254 =head2 FLOATGAP
256 Title : FLOATGAP
257 Description : (optional) variable length gap penalty
259 =head2 MATRIX
261 Title : MATRIX
262 Default : PAM100 for DNA - PAM250 for protein alignment
263 Description : (optional) substitution matrix used in the multiple
264 alignments. Depends on the version of clustalw as to
265 what default matrix will be used
267 PROTEIN WEIGHT MATRIX leads to a new menu where you are
268 offered a choice of weight matrices. The default for
269 proteins in version 1.8 is the PAM series derived by
270 Gonnet and colleagues. Note, a series is used! The
271 actual matrix that is used depends on how similar the
272 sequences to be aligned at this alignment step
273 are. Different matrices work differently at each
274 evolutionary distance.
276 DNA WEIGHT MATRIX leads to a new menu where a single
277 matrix (not a series) can be selected. The default is
278 the matrix used by BESTFIT for comparison of nucleic
279 acid sequences.
281 =head2 TYPE
283 Title : TYPE
284 Description : (optional) sequence type: protein or DNA. This allows
285 you to explicitly overide the programs attempt at
286 guessing the type of the sequence. It is only useful
287 if you are using sequences with a VERY strange
288 composition.
290 =head2 OUTPUT
292 Title : OUTPUT
293 Description : (optional) clustalw supports GCG or PHYLIP or PIR or
294 Clustal format. See the Bio::AlignIO modules for
295 which formats are supported by bioperl.
297 =head2 OUTFILE
299 Title : OUTFILE
300 Description : (optional) Name of clustalw output file. If not set
301 module will erase output file. In any case alignment will
302 be returned in the form of SimpleAlign objects
304 =head2 TRANSMIT
306 Title : TRANSMIT
307 Description : (optional) transitions not weighted. The default is to
308 weight transitions as more favourable than other
309 mismatches in DNA alignments. This switch makes all
310 nucleotide mismatches equally weighted.
312 =head1 FEEDBACK
314 =head2 Mailing Lists
316 User feedback is an integral part of the evolution of this and other
317 Bioperl modules. Send your comments and suggestions preferably to one
318 of the Bioperl mailing lists. Your participation is much appreciated.
320 bioperl-l@bioperl.org - General discussion
321 http://bioperl.org/wiki/Mailing_lists - About the mailing lists
323 =head2 Support
325 Please direct usage questions or support issues to the mailing list:
327 I<bioperl-l@bioperl.org>
329 rather than to the module maintainer directly. Many experienced and
330 reponsive experts will be able look at the problem and quickly
331 address it. Please include a thorough description of the problem
332 with code and data examples if at all possible.
334 =head2 Reporting Bugs
336 Report bugs to the Bioperl bug tracking system to help us keep track
337 the bugs and their resolution. Bug reports can be submitted via the
338 web:
340 http://redmine.open-bio.org/projects/bioperl/
342 =head1 AUTHOR - Peter Schattner
344 Email schattner@alum.mit.edu
346 =head1 CONTRIBUTORS
348 Jason Stajich jason-AT-bioperl_DOT_org
349 Sendu Bala bix@sendu.me.uk
351 =head1 APPENDIX
353 The rest of the documentation details each of the object
354 methods. Internal methods are usually preceded with a _
356 =cut
360 package Bio::Tools::Run::Alignment::Clustalw;
362 use strict;
363 use Bio::Seq;
364 use Bio::SeqIO;
365 use Bio::SimpleAlign;
366 use Bio::AlignIO;
367 use Bio::TreeIO;
368 use Bio::Root::IO;
370 use base qw(Bio::Root::Root Bio::Tools::Run::WrapperBase);
372 our @CLUSTALW_PARAMS = qw(output ktuple topdiags window pairgap fixedgap
373 floatgap matrix type transit dnamatrix outfile
374 gapopen gapext maxdiv gapdist hgapresidues pwmatrix
375 pwdnamatrix pwgapopen pwgapext score transweight
376 seed helixgap outorder strandgap loopgap terminalgap
377 helixendin helixendout strandendin strandendout program
378 reps outputtree seed bootlabels bootstrap);
380 our @CLUSTALW_SWITCHES = qw(help check options negative noweights endgaps
381 nopgap nohgap novgap kimura tossgaps
382 kimura tossgaps njtree);
383 our @OTHER_SWITCHES = qw(quiet);
384 our $PROGRAM_NAME = 'clustalw';
385 our $PROGRAM_DIR = $ENV{'CLUSTALDIR'} || $ENV{'CLUSTALWDIR'};
388 =head2 program_name
390 Title : program_name
391 Usage : $factory>program_name()
392 Function: holds the program name
393 Returns: string
394 Args : None
396 =cut
398 sub program_name {
399 return $PROGRAM_NAME;
402 =head2 program_dir
404 Title : program_dir
405 Usage : $factory->program_dir(@params)
406 Function: returns the program directory, obtained from ENV variable.
407 Returns: string
408 Args :
410 =cut
412 sub program_dir {
413 return $PROGRAM_DIR;
416 sub new {
417 my ($class,@args) = @_;
418 my $self = $class->SUPER::new(@args);
420 $self->_set_from_args(\@args, -methods => [@CLUSTALW_PARAMS,
421 @CLUSTALW_SWITCHES,
422 @OTHER_SWITCHES],
423 -create => 1);
425 return $self;
428 =head2 version
430 Title : version
431 Usage : exit if $prog->version() < 1.8
432 Function: Determine the version number of the program
433 Example :
434 Returns : float or undef
435 Args : none
437 =cut
439 sub version {
440 my ($self) = @_;
442 return undef unless $self->executable;
443 my $prog = $self->executable;
444 my $string = `$prog -- 2>&1` ;
445 $string =~ /\(?([\d.]+)\)?/xms;
446 return $1 || undef;
449 =head2 run
451 Title : run
452 Usage : ($aln, $tree) = $factory->run($inputfilename);
453 ($aln, $tree) = $factory->run($seq_array_ref);
454 Function: Perform a multiple sequence alignment, generating a tree at the same
455 time. (Like align() and tree() combined.)
456 Returns : A SimpleAlign object containing the sequence alignment and a
457 Bio::Tree::Tree object with the tree relating the sequences.
458 Args : Name of a file containing a set of unaligned fasta sequences
459 or else an array of references to Bio::Seq objects.
461 =cut
463 sub run {
464 my ($self,$input) = @_;
465 my ($temp,$infilename, $seq);
466 my ($attr, $value, $switch);
468 $self->io->_io_cleanup();
469 # Create input file pointer
470 $infilename = $self->_setinput($input);
471 $self->throw("Bad input data (sequences need an id) or less than 2 sequences in $input!") unless $infilename;
473 # Create parameter string to pass to clustalw program
474 my $param_string = $self->_setparams();
476 # run clustalw
477 return $self->_run('both', $infilename, $param_string);
480 =head2 align
482 Title : align
483 Usage : $inputfilename = 't/data/cysprot.fa';
484 $aln = $factory->align($inputfilename);
486 $seq_array_ref = \@seq_array; # @seq_array is array of Seq objs
487 $aln = $factory->align($seq_array_ref);
488 Function: Perform a multiple sequence alignment
489 Returns : Reference to a SimpleAlign object containing the
490 sequence alignment.
491 Args : Name of a file containing a set of unaligned fasta sequences
492 or else an array of references to Bio::Seq objects.
494 Throws an exception if argument is not either a string (eg a
495 filename) or a reference to an array of Bio::Seq objects. If
496 argument is string, throws exception if file corresponding to string
497 name can not be found. If argument is Bio::Seq array, throws
498 exception if less than two sequence objects are in array.
500 =cut
502 sub align {
503 my ($self,$input) = @_;
505 $self->io->_io_cleanup();
507 # Create input file pointer
508 my $infilename = $self->_setinput($input);
509 $self->throw("Bad input data (sequences need an id ) or less than 2 sequences in $input !") unless $infilename;
511 # Create parameter string to pass to clustalw program
512 my $param_string = $self->_setparams();
514 # run clustalw
515 my $aln = $self->_run('align', $infilename, $param_string);
519 =head2 profile_align
521 Title : profile_align
522 Usage : $aln = $factory->profile_align(@simple_aligns);
524 $aln = $factory->profile_align(@subalignment_filenames);
525 Function: Perform an alignment of 2 (sub)alignments
526 Returns : Reference to a SimpleAlign object containing the (super)alignment.
527 Args : Names of 2 files containing the subalignments
528 or references to 2 Bio::SimpleAlign objects.
530 Throws an exception if arguments are not either strings (eg filenames)
531 or references to SimpleAlign objects.
533 =cut
535 sub profile_align {
536 my ($self,$input1,$input2) = @_;
538 $self->io->_io_cleanup();
540 # Create input file pointer
541 my $infilename1 = $self->_setinput($input1, 1);
542 my $infilename2 = $self->_setinput($input2, 2);
543 if (!$infilename1 || !$infilename2) {$self->throw("Bad input data: $input1 or $input2 !");}
544 unless ( -e $infilename1 and -e $infilename2) {$self->throw("Bad input file: $input1 or $input2 !");}
546 # Create parameter string to pass to clustalw program
547 my $param_string = $self->_setparams();
549 # run clustalw
550 my $aln = $self->_run('profile-aln', $infilename1, $infilename2, $param_string);
553 =head2 add_sequences
555 Title : add_sequences
556 Usage :
557 Function: Align and add sequences into an alignment
558 Example :
559 Returns : Reference to a SimpleAlign object containing the (super)alignment.
560 Args : Names of 2 files, the first one containing an alignment and the second one containing sequences to be added
561 or references to 2 Bio::SimpleAlign objects.
563 Throws an exception if arguments are not either strings (eg filenames)
564 or references to SimpleAlign objects.
567 =cut
569 sub add_sequences {
571 my ($self,$input1,$input2) = @_;
572 my ($temp,$infilename1,$infilename2,$input,$seq);
574 $self->io->_io_cleanup();
575 # Create input file pointer
576 $infilename1 = $self->_setinput($input1,1);
577 $infilename2 = $self->_setinput($input2,2);
578 if (!$infilename1 || !$infilename2) {$self->throw("Bad input data: $input1 or $input2 !");}
579 unless ( -e $infilename1 and -e $infilename2) {$self->throw("Bad input file: $input1 or $input2 !");}
582 # Create parameter string to pass to clustalw program
583 my $param_string = $self->_setparams();
584 # run clustalw
585 my $aln = $self->_run('add_sequences', $infilename1,
586 $infilename2, $param_string);
590 =head2 tree
592 Title : tree
593 Usage : @params = ('bootstrap' => 1000,
594 'tossgaps' => 1,
595 'kimura' => 1,
596 'seed' => 121,
597 'bootlabels'=> 'nodes',
598 'quiet' => 1);
599 $factory = Bio::Tools::Run::Alignment::Clustalw->new(@params);
600 $tree_obj = $factory->tree($aln_obj);
602 $tree_obj = $factory->tree($treefilename);
603 Function: Retrieve a tree corresponding to the input
604 Returns : Bio::TreeIO object
605 Args : Bio::SimpleAlign or filename of a tree
607 =cut
609 sub tree {
610 my ($self,$input) = @_;
612 $self->io->_io_cleanup();
614 # Create input file pointer
615 my $infilename = $self->_setinput($input);
617 if (!$infilename) {$self->throw("Bad input data (sequences need an id ) or less than 2 sequences in $input !");}
619 # Create parameter string to pass to clustalw program
620 my $param_string = $self->_setparams();
622 # run clustalw
623 my $tree = $self->_run('tree', $infilename, $param_string);
626 =head2 footprint
628 Title : footprint
629 Usage : @alns = $factory->footprint($treefilename, $window_size, $diff);
630 @alns = $factory->footprint($seqs_array_ref);
631 Function: Aligns all the supplied sequences and slices out of the alignment
632 those regions along a sliding window who's tree length differs
633 significantly from the total average tree length.
634 Returns : list of Bio::SimpleAlign objects
635 Args : first argument as per run(), optional second argument to specify
636 the size of the sliding window (default 5 bp) and optional third
637 argument to specify the % difference from the total tree length
638 needed for a window to be considered a footprint (default 33%).
640 =cut
642 sub footprint {
643 my ($self, $in, $slice_size, $deviate) = @_;
645 my ($simplealn, $tree) = $self->run($in);
647 # total tree length?
648 my $total_length = $tree->total_branch_length;
650 # tree length along sliding window, picking regions that significantly
651 # deviate from the average tree length
652 $slice_size ||= 5;
653 $deviate ||= 33;
654 my $threshold = $total_length - (($total_length / 100) * $deviate);
655 my $length = $simplealn->length;
656 my $below = 0;
657 my $found_minima = 0;
658 my $minima = [$threshold, ''];
659 my @results;
660 for my $i (1..($length - $slice_size + 1)) {
661 my $slice = $simplealn->slice($i, ($i + $slice_size - 1), 1);
662 my $tree = $self->tree($slice);
663 $self->throw("No tree returned") unless defined $tree;
664 my $slice_length = $tree->total_branch_length;
666 $slice_length <= $threshold ? ($below = 1) : ($below = 0);
667 if ($below) {
668 unless ($found_minima) {
669 if ($slice_length < ${$minima}[0]) {
670 $minima = [$slice_length, $slice];
672 else {
673 push(@results, ${$minima}[1]);
674 $minima = [$threshold, ''];
675 $found_minima = 1;
679 else {
680 $found_minima = 0;
684 return @results;
687 =head2 _run
689 Title : _run
690 Usage : Internal function, not to be called directly
691 Function: makes actual system call to clustalw program
692 Returns : nothing; clustalw output is written to a
693 temporary file
694 Args : Name of a file containing a set of unaligned fasta sequences
695 and hash of parameters to be passed to clustalw
697 =cut
699 sub _run {
700 my ($self, $command, $infile1, $infile2, $param_string) = @_;
702 my ($instring, $tree);
703 my $quiet = $self->quiet() || $self->verbose() < 0;
705 if ($command =~ /align|both/) {
706 if ($^O eq 'dec_osf') {
707 $instring = $infile1;
708 $command = '';
710 else {
711 $instring = " -infile=". '"' . $infile1 . '"';
713 $param_string .= " $infile2";
716 if ($command =~ /profile/) {
717 $instring = "-profile1=$infile1 -profile2=$infile2";
718 chmod 0777, $infile1, $infile2;
719 $command = '-profile';
722 if ($command =~ /add_sequences/) {
723 $instring = "-profile1=$infile1 -profile2=$infile2";
724 chmod 0777, $infile1,$infile2;
725 $command = '-sequences';
728 if ($command =~ /tree/) {
729 if( $^O eq 'dec_osf' ) {
730 $instring = $infile1;
731 $command = '';
733 else {
734 $instring = " -infile=". '"' . $infile1 . '"';
736 $param_string .= " $infile2";
738 $self->debug( "Program ".$self->executable."\n");
739 my $commandstring = $self->executable."$instring"."$param_string";
740 $commandstring .= ' 1>/dev/null' if $quiet;
741 $self->debug( "clustal command = $commandstring");
743 my $status = system($commandstring);
744 unless( $status == 0 ) {
745 $self->warn( "Clustalw call ($commandstring) crashed: $? \n");
746 return undef;
749 return $self->_get_tree($infile1, $param_string);
752 my $output = $self->output || 'gcg';
753 $self->debug( "Program ".$self->executable."\n");
754 my $commandstring = $self->executable." $command"." $instring"." -output=$output". " $param_string";
755 $self->debug( "clustal command = $commandstring\n");
757 open(my $pipe, "$commandstring |") || $self->throw("ClustalW call ($commandstring) failed to start: $? | $!");
758 my $score;
759 while (<$pipe>) {
760 print unless $quiet;
761 # Kevin Brown suggested the following regex, though it matches multiple
762 # times: we pick up the last one
763 $score = $1 if ($_ =~ /Score:(\d+)/);
764 # This one is printed at the end and seems the most appropriate to pick
765 # up; we include the above regex incase 'Alignment Score' isn't given
766 $score = $1 if ($_ =~ /Alignment Score (-?\d+)/);
768 close($pipe) || ($self->throw("ClustalW call ($commandstring) crashed: $?"));
770 my $outfile = $self->outfile();
772 # retrieve alignment (Note: MSF format for AlignIO = GCG format of clustalw)
773 my $format = $output =~ /gcg/i ? 'msf' : $output;
774 if ($format =~ /clustal/i) {
775 $format = 'clustalw'; # force clustalw incase 'clustal' is requested
777 my $in = Bio::AlignIO->new(-file => $outfile, -format=> $format);
778 my $aln = $in->next_aln();
779 $in->close;
780 $aln->score($score);
782 if ($command eq 'both') {
783 $tree = $self->_get_tree($infile1, $param_string);
786 # Clean up the temporary files created along the way...
787 # Replace file suffix with dnd to find name of dendrogram file(s) to delete
788 unless ( $self->save_tempfiles ) {
789 foreach my $f ($infile1, $infile2) {
790 $f =~ s/\.[^\.]*$// ;
791 unlink $f .'.dnd' if ($f ne '');
795 if ($command eq 'both') {
796 return ($aln, $tree);
798 return $aln;
801 sub _get_tree {
802 my ($self, $treefile, $param_string) = @_;
804 $treefile =~ s/\.[^\.]*$// ;
806 if ($param_string =~ /-bootstrap/) {
807 $treefile .= '.phb';
809 elsif ($param_string =~ /-tree/) {
810 $treefile .= '.ph';
812 else {
813 $treefile .= '.dnd';
816 my $in = Bio::TreeIO->new('-file' => $treefile,
817 '-format'=> 'newick');
819 my $tree = $in->next_tree;
820 unless ( $self->save_tempfiles ) {
821 foreach my $f ( $treefile ) {
822 unlink $f if( $f ne '' );
826 return $tree;
829 =head2 _setinput()
831 Title : _setinput
832 Usage : Internal function, not to be called directly
833 Function: Create input file for clustalw program
834 Returns : name of file containing clustalw data input
835 Args : Seq or Align object reference or input file name
837 =cut
839 sub _setinput {
840 my ($self, $input, $suffix) = @_;
841 my ($infilename, $seq, $temp, $tfh);
843 # suffix is used to distinguish alignment files If $input is not a
844 # reference it better be the name of a file with the sequence/
846 # alignment data...
847 unless (ref $input) {
848 # check that file exists or throw
849 $infilename = $input;
850 return unless -e $input;
851 return $infilename;
854 # $input may be an array of BioSeq objects...
855 if (ref($input) eq "ARRAY") {
856 # Open temporary file for both reading & writing of BioSeq array
857 ($tfh,$infilename) = $self->io->tempfile(-dir=>$self->tempdir);
858 $temp = Bio::SeqIO->new('-fh'=>$tfh, '-format' =>'Fasta');
860 # Need at least 2 seqs for alignment
861 return unless (scalar(@$input) > 1);
863 foreach $seq (@$input) {
864 return unless (defined $seq && $seq->isa("Bio::PrimarySeqI") and $seq->id());
865 $temp->write_seq($seq);
867 $temp->close();
868 close($tfh);
869 undef $tfh;
870 return $infilename;
873 # $input may be a SimpleAlign object.
874 elsif (ref($input) eq "Bio::SimpleAlign") {
875 # Open temporary file for both reading & writing of SimpleAlign object
876 ($tfh,$infilename) = $self->io->tempfile(-dir=>$self->tempdir);
877 $temp = Bio::AlignIO->new('-fh'=> $tfh, '-format' => 'fasta');
878 $temp->write_aln($input);
879 close($tfh);
880 undef $tfh;
881 return $infilename;
884 # or $input may be a single BioSeq object (to be added to a previous alignment)
885 elsif (ref($input) && $input->isa("Bio::PrimarySeqI") && $suffix==2) {
886 # Open temporary file for both reading & writing of BioSeq object
887 ($tfh,$infilename) = $self->io->tempfile();
888 $temp = Bio::SeqIO->new(-fh=> $tfh, '-format' =>'Fasta');
889 $temp->write_seq($input);
890 close($tfh);
891 undef $tfh;
892 return $infilename;
895 return;
898 =head2 _setparams()
900 Title : _setparams
901 Usage : Internal function, not to be called directly
902 Function: Create parameter inputs for clustalw program
903 Returns : parameter string to be passed to clustalw
904 during align or profile_align
905 Args : name of calling object
907 =cut
909 sub _setparams {
910 my $self = shift;
912 my $param_string = $self->SUPER::_setparams(-params => \@CLUSTALW_PARAMS,
913 -switches => \@CLUSTALW_SWITCHES,
914 -dash => 1,
915 -lc => 1,
916 -join => '=');
918 # Set default output file if no explicit output file selected
919 unless ($param_string =~ /outfile/) {
920 my ($tfh, $outfile) = $self->io->tempfile(-dir => $self->tempdir());
921 close($tfh);
922 undef $tfh;
923 $self->outfile($outfile);
924 $param_string .= " -outfile=\"$outfile\"" ;
927 $param_string .= ' 2>&1';
929 return $param_string;