perltidy; fixes per Peter Rice, re: duplicated accessions in the index and unmunged...
[bioperl-live.git] / Bio / AlignIO.pm
blob33dfe448bed707f8676f919158a5fc3d8414a468
2 # BioPerl module for Bio::AlignIO
4 # based on the Bio::SeqIO module
5 # by Ewan Birney <birney@ebi.ac.uk>
6 # and Lincoln Stein <lstein@cshl.org>
8 # Copyright Peter Schattner
10 # You may distribute this module under the same terms as perl itself
12 # History
13 # September, 2000 AlignIO written by Peter Schattner
15 # POD documentation - main docs before the code
17 =head1 NAME
19 Bio::AlignIO - Handler for AlignIO Formats
21 =head1 SYNOPSIS
23 use Bio::AlignIO;
25 $inputfilename = "testaln.fasta";
26 $in = Bio::AlignIO->new(-file => $inputfilename ,
27 -format => 'fasta');
28 $out = Bio::AlignIO->new(-file => ">out.aln.pfam" ,
29 -format => 'pfam');
31 while ( my $aln = $in->next_aln() ) {
32 $out->write_aln($aln);
35 # OR
37 use Bio::AlignIO;
39 open MYIN,"testaln.fasta";
40 $in = Bio::AlignIO->newFh(-fh => \*MYIN,
41 -format => 'fasta');
42 open my $MYOUT, '>', 'testaln.pfam';
43 $out = Bio::AlignIO->newFh(-fh => $MYOUT,
44 -format => 'pfam');
46 # World's smallest Fasta<->pfam format converter:
47 print $out $_ while <$in>;
49 =head1 DESCRIPTION
51 L<Bio::AlignIO> is a handler module for the formats in the AlignIO set,
52 for example, L<Bio::AlignIO::fasta>. It is the officially sanctioned way
53 of getting at the alignment objects. The resulting alignment is a
54 L<Bio::Align::AlignI>-compliant object.
56 The idea is that you request an object for a particular format.
57 All the objects have a notion of an internal file that is read
58 from or written to. A particular AlignIO object instance is configured
59 for either input or output, you can think of it as a stream object.
61 Each object has functions:
63 $stream->next_aln();
65 And:
67 $stream->write_aln($aln);
69 Also:
71 $stream->type() # returns 'INPUT' or 'OUTPUT'
73 As an added bonus, you can recover a filehandle that is tied to the
74 AlignIO object, allowing you to use the standard E<lt>E<gt> and print
75 operations to read and write alignment objects:
77 use Bio::AlignIO;
79 # read from standard input
80 $stream = Bio::AlignIO->newFh(-format => 'Fasta');
82 while ( $aln = <$stream> ) {
83 # do something with $aln
86 And:
88 print $stream $aln; # when stream is in output mode
90 L<Bio::AlignIO> is patterned on the L<Bio::SeqIO> module and shares
91 most of its features. One significant difference is that
92 L<Bio::AlignIO> usually handles IO for only a single alignment at a time,
93 whereas L<Bio::SeqIO> handles IO for multiple sequences in a single stream.
94 The principal reason for this is that whereas simultaneously handling
95 multiple sequences is a common requirement, simultaneous handling of
96 multiple alignments is not. The only current exception is format
97 C<bl2seq> which parses results of the BLAST C<bl2seq> program and which
98 may produce several alignment pairs. This set of alignment pairs can
99 be read using multiple calls to L<next_aln>.
101 =head1 CONSTRUCTORS
103 =head2 Bio::AlignIO-E<gt>new()
105 $seqIO = Bio::AlignIO->new(-file => 'filename', -format=>$format);
106 $seqIO = Bio::AlignIO->new(-fh => \*FILEHANDLE, -format=>$format);
107 $seqIO = Bio::AlignIO->new(-format => $format);
108 $seqIO = Bio::AlignIO->new(-fh => \*STDOUT, -format => $format);
110 The L<new> class method constructs a new L<Bio::AlignIO> object.
111 The returned object can be used to retrieve or print alignment
112 objects. L<new> accepts the following parameters:
114 =over 4
116 =item -file
118 A file path to be opened for reading or writing. The usual Perl
119 conventions apply:
121 'file' # open file for reading
122 '>file' # open file for writing
123 '>>file' # open file for appending
124 '+<file' # open file read/write
125 'command |' # open a pipe from the command
126 '| command' # open a pipe to the command
128 =item -fh
130 You may provide new() with a previously-opened filehandle. For
131 example, to read from STDIN:
133 $seqIO = Bio::AlignIO->new(-fh => \*STDIN);
135 Note that you must pass filehandles as references to globs.
137 If neither a filehandle nor a filename is specified, then the module
138 will read from the @ARGV array or STDIN, using the familiar E<lt>E<gt>
139 semantics.
141 =item -format
143 Specify the format of the file. Supported formats include:
145 bl2seq Bl2seq Blast output
146 clustalw clustalw (.aln) format
147 emboss EMBOSS water and needle format
148 fasta FASTA format
149 maf Multiple Alignment Format
150 mase mase (seaview) format
151 mega MEGA format
152 meme MEME format
153 msf msf (GCG) format
154 nexus Swofford et al NEXUS format
155 pfam Pfam sequence alignment format
156 phylip Felsenstein PHYLIP format
157 prodom prodom (protein domain) format
158 psi PSI-BLAST format
159 selex selex (hmmer) format
160 stockholm stockholm format
162 Currently only those formats which were implemented in L<Bio::SimpleAlign>
163 have been incorporated into L<Bio::AlignIO>. Specifically, C<mase>, C<stockholm>
164 and C<prodom> have only been implemented for input. See the specific module
165 (e.g. L<Bio::AlignIO::prodom>) for notes on supported versions.
167 If no format is specified and a filename is given, then the module
168 will attempt to deduce it from the filename suffix. If this is unsuccessful,
169 C<fasta> format is assumed.
171 The format name is case insensitive; C<FASTA>, C<Fasta> and C<fasta> are
172 all treated equivalently.
174 =back
176 =head2 Bio::AlignIO-E<gt>newFh()
178 $fh = Bio::AlignIO->newFh(-fh => \*FILEHANDLE, -format=>$format);
179 # read from STDIN or use @ARGV:
180 $fh = Bio::AlignIO->newFh(-format => $format);
182 This constructor behaves like L<new>, but returns a tied filehandle
183 rather than a L<Bio::AlignIO> object. You can read sequences from this
184 object using the familiar E<lt>E<gt> operator, and write to it using
185 L<print>. The usual array and $_ semantics work. For example, you can
186 read all sequence objects into an array like this:
188 @sequences = <$fh>;
190 Other operations, such as read(), sysread(), write(), close(), and printf()
191 are not supported.
193 =over 1
195 =item -flush
197 By default, all files (or filehandles) opened for writing alignments
198 will be flushed after each write_aln() making the file immediately
199 usable. If you do not need this facility and would like to marginally
200 improve the efficiency of writing multiple sequences to the same file
201 (or filehandle), pass the -flush option '0' or any other value that
202 evaluates as defined but false:
204 my $clustal = Bio::AlignIO->new( -file => "<prot.aln",
205 -format => "clustalw" );
206 my $msf = Bio::AlignIO->new(-file => ">prot.msf",
207 -format => "msf",
208 -flush => 0 ); # go as fast as we can!
209 while($seq = $clustal->next_aln) { $msf->write_aln($seq) }
211 =back
213 =head1 OBJECT METHODS
215 See below for more detailed summaries. The main methods are:
217 =head2 $alignment = $AlignIO-E<gt>next_aln()
219 Fetch an alignment from a formatted file.
221 =head2 $AlignIO-E<gt>write_aln($aln)
223 Write the specified alignment to a file..
225 =head2 TIEHANDLE(), READLINE(), PRINT()
227 These provide the tie interface. See L<perltie> for more details.
229 =head1 FEEDBACK
231 =head2 Mailing Lists
233 User feedback is an integral part of the evolution of this and other
234 Bioperl modules. Send your comments and suggestions preferably to one
235 of the Bioperl mailing lists. Your participation is much appreciated.
237 bioperl-l@bioperl.org - General discussion
238 http://bioperl.org/wiki/Mailing_lists - About the mailing lists
240 =head2 Support
242 Please direct usage questions or support issues to the mailing list:
244 I<bioperl-l@bioperl.org>
246 rather than to the module maintainer directly. Many experienced and
247 reponsive experts will be able look at the problem and quickly
248 address it. Please include a thorough description of the problem
249 with code and data examples if at all possible.
251 =head2 Reporting Bugs
253 Report bugs to the Bioperl bug tracking system to help us keep track
254 the bugs and their resolution. Bug reports can be submitted via the
255 web:
257 https://redmine.open-bio.org/projects/bioperl/
259 =head1 AUTHOR - Peter Schattner
261 Email: schattner@alum.mit.edu
263 =head1 CONTRIBUTORS
265 Jason Stajich, jason@bioperl.org
267 =head1 APPENDIX
269 The rest of the documentation details each of the object
270 methods. Internal methods are usually preceded with a _
272 =cut
274 # 'Let the code begin...
276 package Bio::AlignIO;
278 use strict;
280 use Bio::Seq;
281 use Bio::LocatableSeq;
282 use Bio::SimpleAlign;
283 use Bio::Tools::GuessSeqFormat;
284 use base qw(Bio::Root::Root Bio::Root::IO);
286 =head2 new
288 Title : new
289 Usage : $stream = Bio::AlignIO->new(-file => $filename,
290 -format => 'Format')
291 Function: Returns a new seqstream
292 Returns : A Bio::AlignIO::Handler initialised with
293 the appropriate format
294 Args : -file => $filename
295 -format => format
296 -fh => filehandle to attach to
297 -displayname_flat => 1 [optional]
298 to force the displayname to not show start/end
299 information
301 =cut
303 sub new {
304 my ($caller,@args) = @_;
305 my $class = ref($caller) || $caller;
307 # or do we want to call SUPER on an object if $caller is an
308 # object?
309 if( $class =~ /Bio::AlignIO::(\S+)/ ) {
310 my ($self) = $class->SUPER::new(@args);
311 $self->_initialize(@args);
312 return $self;
313 } else {
315 my %param = @args;
316 @param{ map { lc $_ } keys %param } = values %param; # lowercase keys
317 my $format = $param{'-format'} ||
318 $class->_guess_format( $param{-file} || $ARGV[0] );
319 unless ($format) {
320 if ($param{-file}) {
321 $format = Bio::Tools::GuessSeqFormat->new(-file => $param{-file}||$ARGV[0] )->guess;
323 elsif ($param{-fh}) {
324 $format = Bio::Tools::GuessSeqFormat->new(-fh => $param{-fh}||$ARGV[0] )->guess;
327 $format = "\L$format"; # normalize capitalization to lower case
328 $class->throw("Unknown format given or could not determine it [$format]")
329 unless $format;
331 return unless( $class->_load_format_module($format) );
332 return "Bio::AlignIO::$format"->new(@args);
337 =head2 newFh
339 Title : newFh
340 Usage : $fh = Bio::AlignIO->newFh(-file=>$filename,-format=>'Format')
341 Function: does a new() followed by an fh()
342 Example : $fh = Bio::AlignIO->newFh(-file=>$filename,-format=>'Format')
343 $sequence = <$fh>; # read a sequence object
344 print $fh $sequence; # write a sequence object
345 Returns : filehandle tied to the Bio::AlignIO::Fh class
346 Args :
348 =cut
350 sub newFh {
351 my $class = shift;
352 return unless my $self = $class->new(@_);
353 return $self->fh;
356 =head2 fh
358 Title : fh
359 Usage : $obj->fh
360 Function:
361 Example : $fh = $obj->fh; # make a tied filehandle
362 $sequence = <$fh>; # read a sequence object
363 print $fh $sequence; # write a sequence object
364 Returns : filehandle tied to the Bio::AlignIO::Fh class
365 Args :
367 =cut
370 sub fh {
371 my $self = shift;
372 my $class = ref($self) || $self;
373 my $s = Symbol::gensym;
374 tie $$s,$class,$self;
375 return $s;
379 =head2 format
381 Title : format
382 Usage : $format = $stream->format()
383 Function: Get the alignment format
384 Returns : alignment format
385 Args : none
387 =cut
389 # format() method inherited from Bio::Root::IO
392 # _initialize is where the heavy stuff will happen when new is called
394 sub _initialize {
395 my($self,@args) = @_;
396 my ($flat,$alphabet) = $self->_rearrange([qw(DISPLAYNAME_FLAT ALPHABET)],
397 @args);
398 $self->force_displayname_flat($flat) if defined $flat;
399 $self->alphabet($alphabet);
400 $self->_initialize_io(@args);
404 =head2 _load_format_module
406 Title : _load_format_module
407 Usage : *INTERNAL AlignIO stuff*
408 Function: Loads up (like use) a module at run time on demand
409 Example :
410 Returns :
411 Args :
413 =cut
415 sub _load_format_module {
416 my ($self,$format) = @_;
417 my $module = "Bio::AlignIO::" . $format;
418 my $ok;
420 eval {
421 $ok = $self->_load_module($module);
423 if ( $@ ) {
424 print STDERR <<END;
425 $self: $format cannot be found
426 Exception $@
427 For more information about the AlignIO system please see the AlignIO docs.
428 This includes ways of checking for formats at compile time, not run time
431 return;
433 return 1;
436 =head2 next_aln
438 Title : next_aln
439 Usage : $aln = stream->next_aln
440 Function: reads the next $aln object from the stream
441 Returns : a Bio::Align::AlignI compliant object
442 Args :
444 =cut
446 sub next_aln {
447 my ($self,$aln) = @_;
448 $self->throw("Sorry, you cannot read from a generic Bio::AlignIO object.");
451 =head2 write_aln
453 Title : write_aln
454 Usage : $stream->write_aln($aln)
455 Function: writes the $aln object into the stream
456 Returns : 1 for success and 0 for error
457 Args : Bio::Seq object
459 =cut
461 sub write_aln {
462 my ($self,$aln) = @_;
463 $self->throw("Sorry, you cannot write to a generic Bio::AlignIO object.");
466 =head2 _guess_format
468 Title : _guess_format
469 Usage : $obj->_guess_format($filename)
470 Function:
471 Example :
472 Returns : guessed format of filename (lower case)
473 Args :
475 =cut
477 sub _guess_format {
478 my $class = shift;
479 return unless $_ = shift;
480 return 'clustalw' if /\.aln$/i;
481 return 'emboss' if /\.(water|needle)$/i;
482 return 'metafasta' if /\.metafasta$/;
483 return 'fasta' if /\.(fasta|fast|seq|fa|fsa|nt|aa)$/i;
484 return 'maf' if /\.maf/i;
485 return 'mega' if /\.(meg|mega)$/i;
486 return 'meme' if /\.meme$/i;
487 return 'msf' if /\.(msf|pileup|gcg)$/i;
488 return 'nexus' if /\.(nexus|nex)$/i;
489 return 'pfam' if /\.(pfam|pfm)$/i;
490 return 'phylip' if /\.(phylip|phlp|phyl|phy|ph)$/i;
491 return 'psi' if /\.psi$/i;
492 return 'stockholm' if /\.stk$/i;
493 return 'selex' if /\.(selex|slx|selx|slex|sx)$/i;
494 return 'xmfa' if /\.xmfa$/i;
497 sub DESTROY {
498 my $self = shift;
499 $self->close();
502 sub TIEHANDLE {
503 my $class = shift;
504 return bless {'alignio' => shift},$class;
507 sub READLINE {
508 my $self = shift;
509 return $self->{'alignio'}->next_aln() unless wantarray;
510 my (@list,$obj);
511 push @list,$obj while $obj = $self->{'alignio'}->next_aln();
512 return @list;
515 sub PRINT {
516 my $self = shift;
517 $self->{'alignio'}->write_aln(@_);
521 =head2 force_displayname_flat
523 Title : force_displayname_flat
524 Usage : $obj->force_displayname_flat($newval)
525 Function:
526 Example :
527 Returns : value of force_displayname_flat (a scalar)
528 Args : on set, new value (a scalar or undef, optional)
531 =cut
533 sub force_displayname_flat{
534 my $self = shift;
535 return $self->{'_force_displayname_flat'} = shift if @_;
536 return $self->{'_force_displayname_flat'} || 0;
539 =head2 alphabet
541 Title : alphabet
542 Usage : $obj->alphabet($newval)
543 Function: Get/Set alphabet for purpose of passing to Bio::LocatableSeq creation
544 Example : $obj->alphabet('dna');
545 Returns : value of alphabet (a scalar)
546 Args : on set, new value (a scalar or undef, optional)
549 =cut
551 sub alphabet {
552 my $self = shift;
553 my $value = shift;
554 if ( defined $value ) {
555 $self->throw("Invalid alphabet $value") unless $value eq 'rna' || $value eq 'protein' || $value eq 'dna';
556 $self->{'_alphabet'} = $value;
558 return $self->{'_alphabet'};