* sync with trunk
[bioperl-live.git] / Bio / SeqIO.pm
blob5623283119fff786138f26942e30ac04ece5d4b7
1 # $Id$
3 # BioPerl module for Bio::SeqIO
5 # Cared for by Ewan Birney <birney@ebi.ac.uk>
6 # and Lincoln Stein <lstein@cshl.org>
8 # Copyright Ewan Birney
10 # You may distribute this module under the same terms as perl itself
12 # _history
13 # October 18, 1999 Largely rewritten by Lincoln Stein
15 # POD documentation - main docs before the code
17 =head1 NAME
19 Bio::SeqIO - Handler for SeqIO Formats
21 =head1 SYNOPSIS
23 use Bio::SeqIO;
25 $in = Bio::SeqIO->new(-file => "inputfilename" ,
26 -format => 'Fasta');
27 $out = Bio::SeqIO->new(-file => ">outputfilename" ,
28 -format => 'EMBL');
30 while ( my $seq = $in->next_seq() ) {
31 $out->write_seq($seq);
34 # Now, to actually get at the sequence object, use the standard Bio::Seq
35 # methods (look at Bio::Seq if you don't know what they are)
37 use Bio::SeqIO;
39 $in = Bio::SeqIO->new(-file => "inputfilename" ,
40 -format => 'genbank');
42 while ( my $seq = $in->next_seq() ) {
43 print "Sequence ",$seq->id, " first 10 bases ",
44 $seq->subseq(1,10), "\n";
48 # The SeqIO system does have a filehandle binding. Most people find this
49 # a little confusing, but it does mean you can write the world's
50 # smallest reformatter
52 use Bio::SeqIO;
54 $in = Bio::SeqIO->newFh(-file => "inputfilename" ,
55 -format => 'Fasta');
56 $out = Bio::SeqIO->newFh(-format => 'EMBL');
58 # World's shortest Fasta<->EMBL format converter:
59 print $out $_ while <$in>;
62 =head1 DESCRIPTION
64 Bio::SeqIO is a handler module for the formats in the SeqIO set (eg,
65 Bio::SeqIO::fasta). It is the officially sanctioned way of getting at
66 the format objects, which most people should use.
68 The Bio::SeqIO system can be thought of like biological file handles.
69 They are attached to filehandles with smart formatting rules (eg,
70 genbank format, or EMBL format, or binary trace file format) and
71 can either read or write sequence objects (Bio::Seq objects, or
72 more correctly, Bio::SeqI implementing objects, of which Bio::Seq is
73 one such object). If you want to know what to do with a Bio::Seq
74 object, read L<Bio::Seq>.
76 The idea is that you request a stream object for a particular format.
77 All the stream objects have a notion of an internal file that is read
78 from or written to. A particular SeqIO object instance is configured
79 for either input or output. A specific example of a stream object is
80 the Bio::SeqIO::fasta object.
82 Each stream object has functions
84 $stream->next_seq();
86 and
88 $stream->write_seq($seq);
90 As an added bonus, you can recover a filehandle that is tied to the
91 SeqIO object, allowing you to use the standard E<lt>E<gt> and print
92 operations to read and write sequence objects:
94 use Bio::SeqIO;
96 $stream = Bio::SeqIO->newFh(-format => 'Fasta',
97 -fh => \*ARGV);
98 # read from standard input or the input filenames
100 while ( $seq = <$stream> ) {
101 # do something with $seq
106 print $stream $seq; # when stream is in output mode
108 This makes the simplest ever reformatter
110 #!/usr/bin/perl
112 $format1 = shift;
113 $format2 = shift || die
114 "Usage: reformat format1 format2 < input > output";
116 use Bio::SeqIO;
118 $in = Bio::SeqIO->newFh(-format => $format1, -fh => \*ARGV );
119 $out = Bio::SeqIO->newFh(-format => $format2 );
120 # Note: you might want to quote -format to keep older
121 # perl's from complaining.
123 print $out $_ while <$in>;
126 =head1 CONSTRUCTORS
128 =head2 Bio::SeqIO-E<gt>new()
130 $seqIO = Bio::SeqIO->new(-file => 'filename', -format=>$format);
131 $seqIO = Bio::SeqIO->new(-fh => \*FILEHANDLE, -format=>$format);
132 $seqIO = Bio::SeqIO->new(-format => $format);
134 The new() class method constructs a new Bio::SeqIO object. The
135 returned object can be used to retrieve or print Seq objects. new()
136 accepts the following parameters:
138 =over 5
140 =item -file
142 A file path to be opened for reading or writing. The usual Perl
143 conventions apply:
145 'file' # open file for reading
146 '>file' # open file for writing
147 '>>file' # open file for appending
148 '+<file' # open file read/write
149 'command |' # open a pipe from the command
150 '| command' # open a pipe to the command
152 =item -fh
154 You may provide new() with a previously-opened filehandle. For
155 example, to read from STDIN:
157 $seqIO = Bio::SeqIO->new(-fh => \*STDIN);
159 Note that you must pass filehandles as references to globs.
161 If neither a filehandle nor a filename is specified, then the module
162 will read from the @ARGV array or STDIN, using the familiar E<lt>E<gt>
163 semantics.
165 A string filehandle is handy if you want to modify the output in the
166 memory, before printing it out. The following program reads in EMBL
167 formatted entries from a file and prints them out in fasta format with
168 some HTML tags:
170 use Bio::SeqIO;
171 use IO::String;
172 my $in = Bio::SeqIO->new(-file => "emblfile",
173 -format => 'EMBL');
174 while ( my $seq = $in->next_seq() ) {
175 # the output handle is reset for every file
176 my $stringio = IO::String->new($string);
177 my $out = Bio::SeqIO->new(-fh => $stringio,
178 -format => 'fasta');
179 # output goes into $string
180 $out->write_seq($seq);
181 # modify $string
182 $string =~ s|(>)(\w+)|$1<font color="Red">$2</font>|g;
183 # print into STDOUT
184 print $string;
187 =item -format
189 Specify the format of the file. Supported formats include fasta,
190 genbank, embl, swiss (SwissProt), Entrez Gene and tracefile formats
191 such as abi (ABI) and scf. There are many more, for a complete listing
192 see the SeqIO HOWTO (L<http://bioperl.open-bio.org/wiki/HOWTO:SeqIO>).
194 If no format is specified and a filename is given then the module will
195 attempt to deduce the format from the filename suffix. If there is no
196 suffix that Bioperl understands then it will attempt to guess the
197 format based on file content. If this is unsuccessful then Fasta
198 format is assumed.
200 The format name is case-insensitive: 'FASTA', 'Fasta' and 'fasta' are
201 all valid.
203 Currently, the tracefile formats (except for SCF) require installation
204 of the external Staden "io_lib" package, as well as the
205 Bio::SeqIO::staden::read package available from the bioperl-ext
206 repository.
208 =item -alphabet
210 Sets the alphabet ('dna', 'rna', or 'protein'). When the alphabet is
211 set then Bioperl will not attempt to guess what the alphabet is. This
212 may be important because Bioperl does not always guess correctly.
214 =item -flush
216 By default, all files (or filehandles) opened for writing sequences
217 will be flushed after each write_seq() (making the file immediately
218 usable). If you do not need this facility and would like to marginally
219 improve the efficiency of writing multiple sequences to the same file
220 (or filehandle), pass the -flush option '0' or any other value that
221 evaluates as defined but false:
223 my $gb = Bio::SeqIO->new(-file => "<gball.gbk",
224 -format => "gb");
225 my $fa = Bio::SeqIO->new(-file => ">gball.fa",
226 -format => "fasta",
227 -flush => 0); # go as fast as we can!
228 while($seq = $gb->next_seq) { $fa->write_seq($seq) }
231 =back
233 =head2 Bio::SeqIO-E<gt>newFh()
235 $fh = Bio::SeqIO->newFh(-fh => \*FILEHANDLE, -format=>$format);
236 $fh = Bio::SeqIO->newFh(-format => $format);
237 # etc.
239 This constructor behaves like new(), but returns a tied filehandle
240 rather than a Bio::SeqIO object. You can read sequences from this
241 object using the familiar E<lt>E<gt> operator, and write to it using
242 print(). The usual array and $_ semantics work. For example, you can
243 read all sequence objects into an array like this:
245 @sequences = <$fh>;
247 Other operations, such as read(), sysread(), write(), close(), and
248 printf() are not supported.
250 =head1 OBJECT METHODS
252 See below for more detailed summaries. The main methods are:
254 =head2 $sequence = $seqIO-E<gt>next_seq()
256 Fetch the next sequence from the stream.
258 =head2 $seqIO-E<gt>write_seq($sequence [,$another_sequence,...])
260 Write the specified sequence(s) to the stream.
262 =head2 TIEHANDLE(), READLINE(), PRINT()
264 These provide the tie interface. See L<perltie> for more details.
266 =head1 FEEDBACK
268 =head2 Mailing Lists
270 User feedback is an integral part of the evolution of this and other
271 Bioperl modules. Send your comments and suggestions preferably to one
272 of the Bioperl mailing lists.
274 Your participation is much appreciated.
276 bioperl-l@bioperl.org - General discussion
277 http://bioperl.org/wiki/Mailing_lists - About the mailing lists
279 =head2 Reporting Bugs
281 Report bugs to the Bioperl bug tracking system to help us keep track
282 the bugs and their resolution. Bug reports can be submitted via the
283 web:
285 http://bugzilla.open-bio.org/
287 =head1 AUTHOR - Ewan Birney, Lincoln Stein
289 Email birney@ebi.ac.uk
290 lstein@cshl.org
292 =head1 APPENDIX
294 The rest of the documentation details each of the object
295 methods. Internal methods are usually preceded with a _
297 =cut
299 #' Let the code begin...
301 package Bio::SeqIO;
303 use strict;
305 use Bio::Factory::FTLocationFactory;
306 use Bio::Seq::SeqBuilder;
307 use Bio::Tools::GuessSeqFormat;
308 use Symbol;
310 use base qw(Bio::Root::Root Bio::Root::IO Bio::Factory::SequenceStreamI);
312 my %valid_alphabet_cache;
314 =head2 new
316 Title : new
317 Usage : $stream = Bio::SeqIO->new(-file => $filename,
318 -format => 'Format')
319 Function: Returns a new sequence stream
320 Returns : A Bio::SeqIO stream initialised with the appropriate format
321 Args : Named parameters:
322 -file => $filename
323 -fh => filehandle to attach to
324 -format => format
326 Additional arguments may be used to set factories and
327 builders involved in the sequence object creation. None of
328 these must be provided, they all have reasonable defaults.
329 -seqfactory the Bio::Factory::SequenceFactoryI object
330 -locfactory the Bio::Factory::LocationFactoryI object
331 -objbuilder the Bio::Factory::ObjectBuilderI object
333 See L<Bio::SeqIO::Handler>
335 =cut
337 my $entry = 0;
339 sub new {
340 my ($caller,@args) = @_;
341 my $class = ref($caller) || $caller;
343 # or do we want to call SUPER on an object if $caller is an
344 # object?
345 if( $class =~ /Bio::SeqIO::(\S+)/ ) {
346 my ($self) = $class->SUPER::new(@args);
347 $self->_initialize(@args);
348 return $self;
349 } else {
351 my %param = @args;
352 @param{ map { lc $_ } keys %param } = values %param; # lowercase keys
354 if (!defined($param{-file}) && !defined($param{-fh})) {
355 $class->throw("file argument provided, but with an undefined value") if exists($param{'-file'});
356 $class->throw("fh argument provided, but with an undefined value") if (exists($param{'-fh'}));
359 my $format = $param{'-format'} ||
360 $class->_guess_format( $param{-file} || $ARGV[0] );
362 if( ! $format ) {
363 if ($param{-file}) {
364 $format = Bio::Tools::GuessSeqFormat->new(-file => $param{-file}||$ARGV[0] )->guess;
365 } elsif ($param{-fh}) {
366 $format = Bio::Tools::GuessSeqFormat->new(-fh => $param{-fh}||$ARGV[0] )->guess;
369 $format = "\L$format"; # normalize capitalization to lower case
370 $class->throw("Unknown format given or could not determine it [$format]")
371 unless $format;
372 return unless( $class->_load_format_module($format) );
373 return "Bio::SeqIO::$format"->new(@args);
377 =head2 newFh
379 Title : newFh
380 Usage : $fh = Bio::SeqIO->newFh(-file=>$filename,-format=>'Format')
381 Function: does a new() followed by an fh()
382 Example : $fh = Bio::SeqIO->newFh(-file=>$filename,-format=>'Format')
383 $sequence = <$fh>; # read a sequence object
384 print $fh $sequence; # write a sequence object
385 Returns : filehandle tied to the Bio::SeqIO::Fh class
386 Args :
388 See L<Bio::SeqIO::Fh>
390 =cut
392 sub newFh {
393 my $class = shift;
394 return unless my $self = $class->new(@_);
395 return $self->fh;
398 =head2 fh
400 Title : fh
401 Usage : $obj->fh
402 Function:
403 Example : $fh = $obj->fh; # make a tied filehandle
404 $sequence = <$fh>; # read a sequence object
405 print $fh $sequence; # write a sequence object
406 Returns : filehandle tied to Bio::SeqIO class
407 Args : none
409 =cut
412 sub fh {
413 my $self = shift;
414 my $class = ref($self) || $self;
415 my $s = Symbol::gensym;
416 tie $$s,$class,$self;
417 return $s;
420 # _initialize is chained for all SeqIO classes
422 sub _initialize {
423 my($self, @args) = @_;
425 # flush is initialized by the Root::IO init
427 my ($seqfact,$locfact,$objbuilder, $alphabet) =
428 $self->_rearrange([qw(SEQFACTORY
429 LOCFACTORY
430 OBJBUILDER
431 ALPHABET)
432 ], @args);
434 $locfact = Bio::Factory::FTLocationFactory->new(-verbose => $self->verbose)
435 if ! $locfact;
436 $objbuilder = Bio::Seq::SeqBuilder->new(-verbose => $self->verbose)
437 unless $objbuilder;
438 $self->sequence_builder($objbuilder);
439 $self->location_factory($locfact);
441 # note that this should come last because it propagates the sequence
442 # factory to the sequence builder
443 $seqfact && $self->sequence_factory($seqfact);
445 #bug 2160
446 $alphabet && $self->alphabet($alphabet);
449 # initialize the IO part
450 $self->_initialize_io(@args);
453 =head2 next_seq
455 Title : next_seq
456 Usage : $seq = stream->next_seq
457 Function: Reads the next sequence object from the stream and returns it.
459 Certain driver modules may encounter entries in the stream
460 that are either misformatted or that use syntax not yet
461 understood by the driver. If such an incident is
462 recoverable, e.g., by dismissing a feature of a feature
463 table or some other non-mandatory part of an entry, the
464 driver will issue a warning. In the case of a
465 non-recoverable situation an exception will be thrown. Do
466 not assume that you can resume parsing the same stream
467 after catching the exception. Note that you can always turn
468 recoverable errors into exceptions by calling
469 $stream->verbose(2).
471 Returns : a Bio::Seq sequence object
472 Args : none
474 See L<Bio::Root::RootI>, L<Bio::Factory::SeqStreamI>, L<Bio::Seq>
476 =cut
478 sub next_seq {
479 my ($self, $seq) = @_;
480 $self->throw("Sorry, you cannot read from a generic Bio::SeqIO object.");
483 =head2 write_seq
485 Title : write_seq
486 Usage : $stream->write_seq($seq)
487 Function: writes the $seq object into the stream
488 Returns : 1 for success and 0 for error
489 Args : Bio::Seq object
491 =cut
493 sub write_seq {
494 my ($self, $seq) = @_;
495 $self->throw("Sorry, you cannot write to a generic Bio::SeqIO object.");
499 =head2 alphabet
501 Title : alphabet
502 Usage : $self->alphabet($newval)
503 Function: Set/get the molecule type for the Seq objects to be created.
504 Example : $seqio->alphabet('protein')
505 Returns : value of alphabet: 'dna', 'rna', or 'protein'
506 Args : newvalue (optional)
507 Throws : Exception if the argument is not one of 'dna', 'rna', or 'protein'
509 =cut
511 sub alphabet {
512 my ($self, $value) = @_;
514 if ( defined $value) {
515 $value = lc $value;
516 unless ($valid_alphabet_cache{$value}) {
517 # instead of hard-coding the allowed values once more, we check by
518 # creating a dummy sequence object
519 eval {
520 require Bio::PrimarySeq;
521 my $seq = Bio::PrimarySeq->new('-verbose' => $self->verbose,
522 '-alphabet' => $value);
524 if ($@) {
525 $self->throw("Invalid alphabet: $value\n. See Bio::PrimarySeq for allowed values.");
527 $valid_alphabet_cache{$value} = 1;
529 $self->{'alphabet'} = $value;
531 return $self->{'alphabet'};
534 =head2 _load_format_module
536 Title : _load_format_module
537 Usage : *INTERNAL SeqIO stuff*
538 Function: Loads up (like use) a module at run time on demand
539 Example :
540 Returns :
541 Args :
543 =cut
545 sub _load_format_module {
546 my ($self, $format) = @_;
547 my $module = "Bio::SeqIO::" . $format;
548 my $ok;
550 eval {
551 $ok = $self->_load_module($module);
553 if ( $@ ) {
554 print STDERR <<END;
555 $self: $format cannot be found
556 Exception $@
557 For more information about the SeqIO system please see the SeqIO docs.
558 This includes ways of checking for formats at compile time, not run time
562 return $ok;
565 =head2 _concatenate_lines
567 Title : _concatenate_lines
568 Usage : $s = _concatenate_lines($line, $continuation_line)
569 Function: Private. Concatenates two strings assuming that the second stems
570 from a continuation line of the first. Adds a space between both
571 unless the first ends with a dash.
573 Takes care of either arg being empty.
574 Example :
575 Returns : A string.
576 Args :
578 =cut
580 sub _concatenate_lines {
581 my ($self, $s1, $s2) = @_;
583 $s1 .= " " if($s1 && ($s1 !~ /-$/) && $s2);
584 return ($s1 ? $s1 : "") . ($s2 ? $s2 : "");
587 =head2 _filehandle
589 Title : _filehandle
590 Usage : $obj->_filehandle($newval)
591 Function: This method is deprecated. Call _fh() instead.
592 Example :
593 Returns : value of _filehandle
594 Args : newvalue (optional)
596 =cut
598 sub _filehandle {
599 my ($self,@args) = @_;
600 return $self->_fh(@args);
603 =head2 _guess_format
605 Title : _guess_format
606 Usage : $obj->_guess_format($filename)
607 Function: guess format based on file suffix
608 Example :
609 Returns : guessed format of filename (lower case)
610 Args :
611 Notes : formats that _filehandle() will guess include fasta,
612 genbank, scf, pir, embl, raw, gcg, ace, bsml, swissprot,
613 fastq and phd/phred
615 =cut
617 sub _guess_format {
618 my $class = shift;
619 return unless $_ = shift;
620 return 'abi' if /\.ab[i1]$/i;
621 return 'ace' if /\.ace$/i;
622 return 'alf' if /\.alf$/i;
623 return 'bsml' if /\.(bsm|bsml)$/i;
624 return 'ctf' if /\.ctf$/i;
625 return 'embl' if /\.(embl|ebl|emb|dat)$/i;
626 return 'entrezgene' if /\.asn$/i;
627 return 'exp' if /\.exp$/i;
628 return 'fasta' if /\.(fasta|fast|fas|seq|fa|fsa|nt|aa|fna|faa)$/i;
629 return 'fastq' if /\.fastq$/i;
630 return 'gcg' if /\.gcg$/i;
631 return 'genbank' if /\.(gb|gbank|genbank|gbk|gbs)$/i;
632 return 'phd' if /\.(phd|phred)$/i;
633 return 'pir' if /\.pir$/i;
634 return 'pln' if /\.pln$/i;
635 return 'raw' if /\.(txt)$/i;
636 return 'scf' if /\.scf$/i;
637 return 'swiss' if /\.(swiss|sp)$/i;
639 # from Strider 1.4 Release Notes: The file name extensions used by
640 # Strider 1.4 are ".xdna", ".xdgn", ".xrna" and ".xprt" for DNA,
641 # DNA Degenerate, RNA and Protein Sequence Files, respectively
642 return 'strider' if /\.(xdna|xdgn|xrna|xprt)$/i;
644 return 'ztr' if /\.ztr$/i;
647 sub DESTROY {
648 my $self = shift;
649 $self->close();
652 sub TIEHANDLE {
653 my ($class,$val) = @_;
654 return bless {'seqio' => $val}, $class;
657 sub READLINE {
658 my $self = shift;
659 return $self->{'seqio'}->next_seq() unless wantarray;
660 my (@list, $obj);
661 push @list, $obj while $obj = $self->{'seqio'}->next_seq();
662 return @list;
665 sub PRINT {
666 my $self = shift;
667 $self->{'seqio'}->write_seq(@_);
670 =head2 sequence_factory
672 Title : sequence_factory
673 Usage : $seqio->sequence_factory($seqfactory)
674 Function: Get/Set the Bio::Factory::SequenceFactoryI
675 Returns : Bio::Factory::SequenceFactoryI
676 Args : [optional] Bio::Factory::SequenceFactoryI
678 =cut
680 sub sequence_factory{
681 my ($self,$obj) = @_;
682 if( defined $obj ) {
683 if( ! ref($obj) || ! $obj->isa('Bio::Factory::SequenceFactoryI') ) {
684 $self->throw("Must provide a valid Bio::Factory::SequenceFactoryI object to ".ref($self)."::sequence_factory()");
686 $self->{'_seqio_seqfactory'} = $obj;
687 my $builder = $self->sequence_builder();
688 if($builder && $builder->can('sequence_factory') &&
689 (! $builder->sequence_factory())) {
690 $builder->sequence_factory($obj);
693 $self->{'_seqio_seqfactory'};
696 =head2 object_factory
698 Title : object_factory
699 Usage : $obj->object_factory($newval)
700 Function: This is an alias to sequence_factory with a more generic name.
701 Example :
702 Returns : value of object_factory (a scalar)
703 Args : on set, new value (a scalar or undef, optional)
705 =cut
707 sub object_factory{
708 return shift->sequence_factory(@_);
711 =head2 sequence_builder
713 Title : sequence_builder
714 Usage : $seqio->sequence_builder($seqfactory)
715 Function: Get/Set the Bio::Factory::ObjectBuilderI used to build sequence
716 objects.
718 If you do not set the sequence object builder yourself, it
719 will in fact be an instance of L<Bio::Seq::SeqBuilder>, and
720 you may use all methods documented there to configure it.
722 Returns : a Bio::Factory::ObjectBuilderI compliant object
723 Args : [optional] a Bio::Factory::ObjectBuilderI compliant object
725 =cut
727 sub sequence_builder{
728 my ($self,$obj) = @_;
729 if( defined $obj ) {
730 if( ! ref($obj) || ! $obj->isa('Bio::Factory::ObjectBuilderI') ) {
731 $self->throw("Must provide a valid Bio::Factory::ObjectBuilderI object to ".ref($self)."::sequence_builder()");
733 $self->{'_object_builder'} = $obj;
735 $self->{'_object_builder'};
738 =head2 location_factory
740 Title : location_factory
741 Usage : $seqio->location_factory($locfactory)
742 Function: Get/Set the Bio::Factory::LocationFactoryI object to be used for
743 location string parsing
744 Returns : a Bio::Factory::LocationFactoryI implementing object
745 Args : [optional] on set, a Bio::Factory::LocationFactoryI implementing
746 object.
748 =cut
750 sub location_factory{
751 my ($self,$obj) = @_;
752 if( defined $obj ) {
753 if( ! ref($obj) || ! $obj->isa('Bio::Factory::LocationFactoryI') ) {
754 $self->throw("Must provide a valid Bio::Factory::LocationFactoryI" .
755 " object to ".ref($self)."->location_factory()");
757 $self->{'_seqio_locfactory'} = $obj;
759 $self->{'_seqio_locfactory'};