sync w/ main trunk
[bioperl-live.git] / Bio / SeqIO.pm
bloba8262d9bb6c0aafbf08d6237a8f7d6ae040f1de0
1 # $Id$
3 # BioPerl module for Bio::SeqIO
5 # Please direct questions and support issues to <bioperl-l@bioperl.org>
7 # Cared for by Ewan Birney <birney@ebi.ac.uk>
8 # and Lincoln Stein <lstein@cshl.org>
10 # Copyright Ewan Birney
12 # You may distribute this module under the same terms as perl itself
14 # _history
15 # October 18, 1999 Largely rewritten by Lincoln Stein
17 # POD documentation - main docs before the code
19 =head1 NAME
21 Bio::SeqIO - Handler for SeqIO Formats
23 =head1 SYNOPSIS
25 use Bio::SeqIO;
27 $in = Bio::SeqIO->new(-file => "inputfilename" ,
28 -format => 'Fasta');
29 $out = Bio::SeqIO->new(-file => ">outputfilename" ,
30 -format => 'EMBL');
32 while ( my $seq = $in->next_seq() ) {
33 $out->write_seq($seq);
36 # Now, to actually get at the sequence object, use the standard Bio::Seq
37 # methods (look at Bio::Seq if you don't know what they are)
39 use Bio::SeqIO;
41 $in = Bio::SeqIO->new(-file => "inputfilename" ,
42 -format => 'genbank');
44 while ( my $seq = $in->next_seq() ) {
45 print "Sequence ",$seq->id, " first 10 bases ",
46 $seq->subseq(1,10), "\n";
50 # The SeqIO system does have a filehandle binding. Most people find this
51 # a little confusing, but it does mean you can write the world's
52 # smallest reformatter
54 use Bio::SeqIO;
56 $in = Bio::SeqIO->newFh(-file => "inputfilename" ,
57 -format => 'Fasta');
58 $out = Bio::SeqIO->newFh(-format => 'EMBL');
60 # World's shortest Fasta<->EMBL format converter:
61 print $out $_ while <$in>;
64 =head1 DESCRIPTION
66 Bio::SeqIO is a handler module for the formats in the SeqIO set (eg,
67 Bio::SeqIO::fasta). It is the officially sanctioned way of getting at
68 the format objects, which most people should use.
70 The Bio::SeqIO system can be thought of like biological file handles.
71 They are attached to filehandles with smart formatting rules (eg,
72 genbank format, or EMBL format, or binary trace file format) and
73 can either read or write sequence objects (Bio::Seq objects, or
74 more correctly, Bio::SeqI implementing objects, of which Bio::Seq is
75 one such object). If you want to know what to do with a Bio::Seq
76 object, read L<Bio::Seq>.
78 The idea is that you request a stream object for a particular format.
79 All the stream objects have a notion of an internal file that is read
80 from or written to. A particular SeqIO object instance is configured
81 for either input or output. A specific example of a stream object is
82 the Bio::SeqIO::fasta object.
84 Each stream object has functions
86 $stream->next_seq();
88 and
90 $stream->write_seq($seq);
92 As an added bonus, you can recover a filehandle that is tied to the
93 SeqIO object, allowing you to use the standard E<lt>E<gt> and print
94 operations to read and write sequence objects:
96 use Bio::SeqIO;
98 $stream = Bio::SeqIO->newFh(-format => 'Fasta',
99 -fh => \*ARGV);
100 # read from standard input or the input filenames
102 while ( $seq = <$stream> ) {
103 # do something with $seq
108 print $stream $seq; # when stream is in output mode
110 This makes the simplest ever reformatter
112 #!/usr/bin/perl
114 $format1 = shift;
115 $format2 = shift || die
116 "Usage: reformat format1 format2 < input > output";
118 use Bio::SeqIO;
120 $in = Bio::SeqIO->newFh(-format => $format1, -fh => \*ARGV );
121 $out = Bio::SeqIO->newFh(-format => $format2 );
122 # Note: you might want to quote -format to keep older
123 # perl's from complaining.
125 print $out $_ while <$in>;
128 =head1 CONSTRUCTORS
130 =head2 Bio::SeqIO-E<gt>new()
132 $seqIO = Bio::SeqIO->new(-file => 'filename', -format=>$format);
133 $seqIO = Bio::SeqIO->new(-fh => \*FILEHANDLE, -format=>$format);
134 $seqIO = Bio::SeqIO->new(-format => $format);
136 The new() class method constructs a new Bio::SeqIO object. The
137 returned object can be used to retrieve or print Seq objects. new()
138 accepts the following parameters:
140 =over 5
142 =item -file
144 A file path to be opened for reading or writing. The usual Perl
145 conventions apply:
147 'file' # open file for reading
148 '>file' # open file for writing
149 '>>file' # open file for appending
150 '+<file' # open file read/write
151 'command |' # open a pipe from the command
152 '| command' # open a pipe to the command
154 =item -fh
156 You may provide new() with a previously-opened filehandle. For
157 example, to read from STDIN:
159 $seqIO = Bio::SeqIO->new(-fh => \*STDIN);
161 Note that you must pass filehandles as references to globs.
163 If neither a filehandle nor a filename is specified, then the module
164 will read from the @ARGV array or STDIN, using the familiar E<lt>E<gt>
165 semantics.
167 A string filehandle is handy if you want to modify the output in the
168 memory, before printing it out. The following program reads in EMBL
169 formatted entries from a file and prints them out in fasta format with
170 some HTML tags:
172 use Bio::SeqIO;
173 use IO::String;
174 my $in = Bio::SeqIO->new(-file => "emblfile",
175 -format => 'EMBL');
176 while ( my $seq = $in->next_seq() ) {
177 # the output handle is reset for every file
178 my $stringio = IO::String->new($string);
179 my $out = Bio::SeqIO->new(-fh => $stringio,
180 -format => 'fasta');
181 # output goes into $string
182 $out->write_seq($seq);
183 # modify $string
184 $string =~ s|(>)(\w+)|$1<font color="Red">$2</font>|g;
185 # print into STDOUT
186 print $string;
189 =item -format
191 Specify the format of the file. Supported formats include fasta,
192 genbank, embl, swiss (SwissProt), Entrez Gene and tracefile formats
193 such as abi (ABI) and scf. There are many more, for a complete listing
194 see the SeqIO HOWTO (L<http://bioperl.open-bio.org/wiki/HOWTO:SeqIO>).
196 If no format is specified and a filename is given then the module will
197 attempt to deduce the format from the filename suffix. If there is no
198 suffix that Bioperl understands then it will attempt to guess the
199 format based on file content. If this is unsuccessful then Fasta
200 format is assumed.
202 The format name is case-insensitive: 'FASTA', 'Fasta' and 'fasta' are
203 all valid.
205 Currently, the tracefile formats (except for SCF) require installation
206 of the external Staden "io_lib" package, as well as the
207 Bio::SeqIO::staden::read package available from the bioperl-ext
208 repository.
210 =item -alphabet
212 Sets the alphabet ('dna', 'rna', or 'protein'). When the alphabet is
213 set then Bioperl will not attempt to guess what the alphabet is. This
214 may be important because Bioperl does not always guess correctly.
216 =item -flush
218 By default, all files (or filehandles) opened for writing sequences
219 will be flushed after each write_seq() (making the file immediately
220 usable). If you do not need this facility and would like to marginally
221 improve the efficiency of writing multiple sequences to the same file
222 (or filehandle), pass the -flush option '0' or any other value that
223 evaluates as defined but false:
225 my $gb = Bio::SeqIO->new(-file => "<gball.gbk",
226 -format => "gb");
227 my $fa = Bio::SeqIO->new(-file => ">gball.fa",
228 -format => "fasta",
229 -flush => 0); # go as fast as we can!
230 while($seq = $gb->next_seq) { $fa->write_seq($seq) }
233 =back
235 =head2 Bio::SeqIO-E<gt>newFh()
237 $fh = Bio::SeqIO->newFh(-fh => \*FILEHANDLE, -format=>$format);
238 $fh = Bio::SeqIO->newFh(-format => $format);
239 # etc.
241 This constructor behaves like new(), but returns a tied filehandle
242 rather than a Bio::SeqIO object. You can read sequences from this
243 object using the familiar E<lt>E<gt> operator, and write to it using
244 print(). The usual array and $_ semantics work. For example, you can
245 read all sequence objects into an array like this:
247 @sequences = <$fh>;
249 Other operations, such as read(), sysread(), write(), close(), and
250 printf() are not supported.
252 =head1 OBJECT METHODS
254 See below for more detailed summaries. The main methods are:
256 =head2 $sequence = $seqIO-E<gt>next_seq()
258 Fetch the next sequence from the stream.
260 =head2 $seqIO-E<gt>write_seq($sequence [,$another_sequence,...])
262 Write the specified sequence(s) to the stream.
264 =head2 TIEHANDLE(), READLINE(), PRINT()
266 These provide the tie interface. See L<perltie> for more details.
268 =head1 FEEDBACK
270 =head2 Mailing Lists
272 User feedback is an integral part of the evolution of this and other
273 Bioperl modules. Send your comments and suggestions preferably to one
274 of the Bioperl mailing lists.
276 Your participation is much appreciated.
278 bioperl-l@bioperl.org - General discussion
279 http://bioperl.org/wiki/Mailing_lists - About the mailing lists
281 =head2 Support
283 Please direct usage questions or support issues to the mailing list:
285 L<bioperl-l@bioperl.org>
287 rather than to the module maintainer directly. Many experienced and
288 reponsive experts will be able look at the problem and quickly
289 address it. Please include a thorough description of the problem
290 with code and data examples if at all possible.
292 =head2 Reporting Bugs
294 Report bugs to the Bioperl bug tracking system to help us keep track
295 the bugs and their resolution. Bug reports can be submitted via the
296 web:
298 http://bugzilla.open-bio.org/
300 =head1 AUTHOR - Ewan Birney, Lincoln Stein
302 Email birney@ebi.ac.uk
303 lstein@cshl.org
305 =head1 APPENDIX
307 The rest of the documentation details each of the object
308 methods. Internal methods are usually preceded with a _
310 =cut
312 #' Let the code begin...
314 package Bio::SeqIO;
316 use strict;
318 use Bio::Factory::FTLocationFactory;
319 use Bio::Seq::SeqBuilder;
320 use Bio::Tools::GuessSeqFormat;
321 use Symbol;
323 use base qw(Bio::Root::Root Bio::Root::IO Bio::Factory::SequenceStreamI);
325 my %valid_alphabet_cache;
327 =head2 new
329 Title : new
330 Usage : $stream = Bio::SeqIO->new(-file => $filename,
331 -format => 'Format')
332 Function: Returns a new sequence stream
333 Returns : A Bio::SeqIO stream initialised with the appropriate format
334 Args : Named parameters:
335 -file => $filename
336 -fh => filehandle to attach to
337 -format => format
339 Additional arguments may be used to set factories and
340 builders involved in the sequence object creation. None of
341 these must be provided, they all have reasonable defaults.
342 -seqfactory the Bio::Factory::SequenceFactoryI object
343 -locfactory the Bio::Factory::LocationFactoryI object
344 -objbuilder the Bio::Factory::ObjectBuilderI object
346 See L<Bio::SeqIO::Handler>
348 =cut
350 my $entry = 0;
352 sub new {
353 my ($caller,@args) = @_;
354 my $class = ref($caller) || $caller;
356 # or do we want to call SUPER on an object if $caller is an
357 # object?
358 if( $class =~ /Bio::SeqIO::(\S+)/ ) {
359 my ($self) = $class->SUPER::new(@args);
360 $self->_initialize(@args);
361 return $self;
362 } else {
364 my %param = @args;
365 @param{ map { lc $_ } keys %param } = values %param; # lowercase keys
367 if (!defined($param{-file}) && !defined($param{-fh})) {
368 $class->throw("file argument provided, but with an undefined value") if exists($param{'-file'});
369 $class->throw("fh argument provided, but with an undefined value") if (exists($param{'-fh'}));
372 my $format = $param{'-format'} ||
373 $class->_guess_format( $param{-file} || $ARGV[0] );
375 if( ! $format ) {
376 if ($param{-file}) {
377 $format = Bio::Tools::GuessSeqFormat->new(-file => $param{-file}||$ARGV[0] )->guess;
378 } elsif ($param{-fh}) {
379 $format = Bio::Tools::GuessSeqFormat->new(-fh => $param{-fh}||$ARGV[0] )->guess;
382 $format = "\L$format"; # normalize capitalization to lower case
383 $class->throw("Unknown format given or could not determine it [$format]")
384 unless $format;
385 return unless( $class->_load_format_module($format) );
386 return "Bio::SeqIO::$format"->new(@args);
390 =head2 newFh
392 Title : newFh
393 Usage : $fh = Bio::SeqIO->newFh(-file=>$filename,-format=>'Format')
394 Function: does a new() followed by an fh()
395 Example : $fh = Bio::SeqIO->newFh(-file=>$filename,-format=>'Format')
396 $sequence = <$fh>; # read a sequence object
397 print $fh $sequence; # write a sequence object
398 Returns : filehandle tied to the Bio::SeqIO::Fh class
399 Args :
401 See L<Bio::SeqIO::Fh>
403 =cut
405 sub newFh {
406 my $class = shift;
407 return unless my $self = $class->new(@_);
408 return $self->fh;
411 =head2 fh
413 Title : fh
414 Usage : $obj->fh
415 Function:
416 Example : $fh = $obj->fh; # make a tied filehandle
417 $sequence = <$fh>; # read a sequence object
418 print $fh $sequence; # write a sequence object
419 Returns : filehandle tied to Bio::SeqIO class
420 Args : none
422 =cut
425 sub fh {
426 my $self = shift;
427 my $class = ref($self) || $self;
428 my $s = Symbol::gensym;
429 tie $$s,$class,$self;
430 return $s;
433 # _initialize is chained for all SeqIO classes
435 sub _initialize {
436 my($self, @args) = @_;
438 # flush is initialized by the Root::IO init
440 my ($seqfact,$locfact,$objbuilder, $alphabet) =
441 $self->_rearrange([qw(SEQFACTORY
442 LOCFACTORY
443 OBJBUILDER
444 ALPHABET)
445 ], @args);
447 $locfact = Bio::Factory::FTLocationFactory->new(-verbose => $self->verbose)
448 if ! $locfact;
449 $objbuilder = Bio::Seq::SeqBuilder->new(-verbose => $self->verbose)
450 unless $objbuilder;
451 $self->sequence_builder($objbuilder);
452 $self->location_factory($locfact);
454 # note that this should come last because it propagates the sequence
455 # factory to the sequence builder
456 $seqfact && $self->sequence_factory($seqfact);
458 #bug 2160
459 $alphabet && $self->alphabet($alphabet);
462 # initialize the IO part
463 $self->_initialize_io(@args);
466 =head2 next_seq
468 Title : next_seq
469 Usage : $seq = stream->next_seq
470 Function: Reads the next sequence object from the stream and returns it.
472 Certain driver modules may encounter entries in the stream
473 that are either misformatted or that use syntax not yet
474 understood by the driver. If such an incident is
475 recoverable, e.g., by dismissing a feature of a feature
476 table or some other non-mandatory part of an entry, the
477 driver will issue a warning. In the case of a
478 non-recoverable situation an exception will be thrown. Do
479 not assume that you can resume parsing the same stream
480 after catching the exception. Note that you can always turn
481 recoverable errors into exceptions by calling
482 $stream->verbose(2).
484 Returns : a Bio::Seq sequence object
485 Args : none
487 See L<Bio::Root::RootI>, L<Bio::Factory::SeqStreamI>, L<Bio::Seq>
489 =cut
491 sub next_seq {
492 my ($self, $seq) = @_;
493 $self->throw("Sorry, you cannot read from a generic Bio::SeqIO object.");
496 =head2 write_seq
498 Title : write_seq
499 Usage : $stream->write_seq($seq)
500 Function: writes the $seq object into the stream
501 Returns : 1 for success and 0 for error
502 Args : Bio::Seq object
504 =cut
506 sub write_seq {
507 my ($self, $seq) = @_;
508 $self->throw("Sorry, you cannot write to a generic Bio::SeqIO object.");
512 =head2 alphabet
514 Title : alphabet
515 Usage : $self->alphabet($newval)
516 Function: Set/get the molecule type for the Seq objects to be created.
517 Example : $seqio->alphabet('protein')
518 Returns : value of alphabet: 'dna', 'rna', or 'protein'
519 Args : newvalue (optional)
520 Throws : Exception if the argument is not one of 'dna', 'rna', or 'protein'
522 =cut
524 sub alphabet {
525 my ($self, $value) = @_;
527 if ( defined $value) {
528 $value = lc $value;
529 unless ($valid_alphabet_cache{$value}) {
530 # instead of hard-coding the allowed values once more, we check by
531 # creating a dummy sequence object
532 eval {
533 require Bio::PrimarySeq;
534 my $seq = Bio::PrimarySeq->new('-verbose' => $self->verbose,
535 '-alphabet' => $value);
537 if ($@) {
538 $self->throw("Invalid alphabet: $value\n. See Bio::PrimarySeq for allowed values.");
540 $valid_alphabet_cache{$value} = 1;
542 $self->{'alphabet'} = $value;
544 return $self->{'alphabet'};
547 =head2 _load_format_module
549 Title : _load_format_module
550 Usage : *INTERNAL SeqIO stuff*
551 Function: Loads up (like use) a module at run time on demand
552 Example :
553 Returns :
554 Args :
556 =cut
558 sub _load_format_module {
559 my ($self, $format) = @_;
560 my $module = "Bio::SeqIO::" . $format;
561 my $ok;
563 eval {
564 $ok = $self->_load_module($module);
566 if ( $@ ) {
567 print STDERR <<END;
568 $self: $format cannot be found
569 Exception $@
570 For more information about the SeqIO system please see the SeqIO docs.
571 This includes ways of checking for formats at compile time, not run time
575 return $ok;
578 =head2 _concatenate_lines
580 Title : _concatenate_lines
581 Usage : $s = _concatenate_lines($line, $continuation_line)
582 Function: Private. Concatenates two strings assuming that the second stems
583 from a continuation line of the first. Adds a space between both
584 unless the first ends with a dash.
586 Takes care of either arg being empty.
587 Example :
588 Returns : A string.
589 Args :
591 =cut
593 sub _concatenate_lines {
594 my ($self, $s1, $s2) = @_;
596 $s1 .= " " if($s1 && ($s1 !~ /-$/) && $s2);
597 return ($s1 ? $s1 : "") . ($s2 ? $s2 : "");
600 =head2 _filehandle
602 Title : _filehandle
603 Usage : $obj->_filehandle($newval)
604 Function: This method is deprecated. Call _fh() instead.
605 Example :
606 Returns : value of _filehandle
607 Args : newvalue (optional)
609 =cut
611 sub _filehandle {
612 my ($self,@args) = @_;
613 return $self->_fh(@args);
616 =head2 _guess_format
618 Title : _guess_format
619 Usage : $obj->_guess_format($filename)
620 Function: guess format based on file suffix
621 Example :
622 Returns : guessed format of filename (lower case)
623 Args :
624 Notes : formats that _filehandle() will guess include fasta,
625 genbank, scf, pir, embl, raw, gcg, ace, bsml, swissprot,
626 fastq and phd/phred
628 =cut
630 sub _guess_format {
631 my $class = shift;
632 return unless $_ = shift;
633 return 'abi' if /\.ab[i1]$/i;
634 return 'ace' if /\.ace$/i;
635 return 'alf' if /\.alf$/i;
636 return 'bsml' if /\.(bsm|bsml)$/i;
637 return 'ctf' if /\.ctf$/i;
638 return 'embl' if /\.(embl|ebl|emb|dat)$/i;
639 return 'entrezgene' if /\.asn$/i;
640 return 'exp' if /\.exp$/i;
641 return 'fasta' if /\.(fasta|fast|fas|seq|fa|fsa|nt|aa|fna|faa)$/i;
642 return 'fastq' if /\.fastq$/i;
643 return 'gcg' if /\.gcg$/i;
644 return 'genbank' if /\.(gb|gbank|genbank|gbk|gbs)$/i;
645 return 'phd' if /\.(phd|phred)$/i;
646 return 'pir' if /\.pir$/i;
647 return 'pln' if /\.pln$/i;
648 return 'raw' if /\.(txt)$/i;
649 return 'scf' if /\.scf$/i;
650 return 'swiss' if /\.(swiss|sp)$/i;
652 # from Strider 1.4 Release Notes: The file name extensions used by
653 # Strider 1.4 are ".xdna", ".xdgn", ".xrna" and ".xprt" for DNA,
654 # DNA Degenerate, RNA and Protein Sequence Files, respectively
655 return 'strider' if /\.(xdna|xdgn|xrna|xprt)$/i;
657 return 'ztr' if /\.ztr$/i;
660 sub DESTROY {
661 my $self = shift;
662 $self->close();
665 sub TIEHANDLE {
666 my ($class,$val) = @_;
667 return bless {'seqio' => $val}, $class;
670 sub READLINE {
671 my $self = shift;
672 return $self->{'seqio'}->next_seq() unless wantarray;
673 my (@list, $obj);
674 push @list, $obj while $obj = $self->{'seqio'}->next_seq();
675 return @list;
678 sub PRINT {
679 my $self = shift;
680 $self->{'seqio'}->write_seq(@_);
683 =head2 sequence_factory
685 Title : sequence_factory
686 Usage : $seqio->sequence_factory($seqfactory)
687 Function: Get/Set the Bio::Factory::SequenceFactoryI
688 Returns : Bio::Factory::SequenceFactoryI
689 Args : [optional] Bio::Factory::SequenceFactoryI
691 =cut
693 sub sequence_factory{
694 my ($self,$obj) = @_;
695 if( defined $obj ) {
696 if( ! ref($obj) || ! $obj->isa('Bio::Factory::SequenceFactoryI') ) {
697 $self->throw("Must provide a valid Bio::Factory::SequenceFactoryI object to ".ref($self)."::sequence_factory()");
699 $self->{'_seqio_seqfactory'} = $obj;
700 my $builder = $self->sequence_builder();
701 if($builder && $builder->can('sequence_factory') &&
702 (! $builder->sequence_factory())) {
703 $builder->sequence_factory($obj);
706 $self->{'_seqio_seqfactory'};
709 =head2 object_factory
711 Title : object_factory
712 Usage : $obj->object_factory($newval)
713 Function: This is an alias to sequence_factory with a more generic name.
714 Example :
715 Returns : value of object_factory (a scalar)
716 Args : on set, new value (a scalar or undef, optional)
718 =cut
720 sub object_factory{
721 return shift->sequence_factory(@_);
724 =head2 sequence_builder
726 Title : sequence_builder
727 Usage : $seqio->sequence_builder($seqfactory)
728 Function: Get/Set the Bio::Factory::ObjectBuilderI used to build sequence
729 objects.
731 If you do not set the sequence object builder yourself, it
732 will in fact be an instance of L<Bio::Seq::SeqBuilder>, and
733 you may use all methods documented there to configure it.
735 Returns : a Bio::Factory::ObjectBuilderI compliant object
736 Args : [optional] a Bio::Factory::ObjectBuilderI compliant object
738 =cut
740 sub sequence_builder{
741 my ($self,$obj) = @_;
742 if( defined $obj ) {
743 if( ! ref($obj) || ! $obj->isa('Bio::Factory::ObjectBuilderI') ) {
744 $self->throw("Must provide a valid Bio::Factory::ObjectBuilderI object to ".ref($self)."::sequence_builder()");
746 $self->{'_object_builder'} = $obj;
748 $self->{'_object_builder'};
751 =head2 location_factory
753 Title : location_factory
754 Usage : $seqio->location_factory($locfactory)
755 Function: Get/Set the Bio::Factory::LocationFactoryI object to be used for
756 location string parsing
757 Returns : a Bio::Factory::LocationFactoryI implementing object
758 Args : [optional] on set, a Bio::Factory::LocationFactoryI implementing
759 object.
761 =cut
763 sub location_factory{
764 my ($self,$obj) = @_;
765 if( defined $obj ) {
766 if( ! ref($obj) || ! $obj->isa('Bio::Factory::LocationFactoryI') ) {
767 $self->throw("Must provide a valid Bio::Factory::LocationFactoryI" .
768 " object to ".ref($self)."->location_factory()");
770 $self->{'_seqio_locfactory'} = $obj;
772 $self->{'_seqio_locfactory'};