3 # BioPerl module for Bio::SeqIO
5 # Cared for by Ewan Birney <birney@ebi.ac.uk>
6 # and Lincoln Stein <lstein@cshl.org>
8 # Copyright Ewan Birney
10 # You may distribute this module under the same terms as perl itself
13 # October 18, 1999 Largely rewritten by Lincoln Stein
15 # POD documentation - main docs before the code
19 Bio::SeqIO - Handler for SeqIO Formats
25 $in = Bio::SeqIO->new(-file => "inputfilename" ,
27 $out = Bio::SeqIO->new(-file => ">outputfilename" ,
30 while ( my $seq = $in->next_seq() ) {
31 $out->write_seq($seq);
34 # Now, to actually get at the sequence object, use the standard Bio::Seq
35 # methods (look at Bio::Seq if you don't know what they are)
39 $in = Bio::SeqIO->new(-file => "inputfilename" ,
40 -format => 'genbank');
42 while ( my $seq = $in->next_seq() ) {
43 print "Sequence ",$seq->id, " first 10 bases ",
44 $seq->subseq(1,10), "\n";
48 # The SeqIO system does have a filehandle binding. Most people find this
49 # a little confusing, but it does mean you can write the world's
50 # smallest reformatter
54 $in = Bio::SeqIO->newFh(-file => "inputfilename" ,
56 $out = Bio::SeqIO->newFh(-format => 'EMBL');
58 # World's shortest Fasta<->EMBL format converter:
59 print $out $_ while <$in>;
64 Bio::SeqIO is a handler module for the formats in the SeqIO set (eg,
65 Bio::SeqIO::fasta). It is the officially sanctioned way of getting at
66 the format objects, which most people should use.
68 The Bio::SeqIO system can be thought of like biological file handles.
69 They are attached to filehandles with smart formatting rules (eg,
70 genbank format, or EMBL format, or binary trace file format) and
71 can either read or write sequence objects (Bio::Seq objects, or
72 more correctly, Bio::SeqI implementing objects, of which Bio::Seq is
73 one such object). If you want to know what to do with a Bio::Seq
74 object, read L<Bio::Seq>.
76 The idea is that you request a stream object for a particular format.
77 All the stream objects have a notion of an internal file that is read
78 from or written to. A particular SeqIO object instance is configured
79 for either input or output. A specific example of a stream object is
80 the Bio::SeqIO::fasta object.
82 Each stream object has functions
88 $stream->write_seq($seq);
90 As an added bonus, you can recover a filehandle that is tied to the
91 SeqIO object, allowing you to use the standard E<lt>E<gt> and print
92 operations to read and write sequence objects:
96 $stream = Bio::SeqIO->newFh(-format => 'Fasta',
98 # read from standard input or the input filenames
100 while ( $seq = <$stream> ) {
101 # do something with $seq
106 print $stream $seq; # when stream is in output mode
108 This makes the simplest ever reformatter
113 $format2 = shift || die
114 "Usage: reformat format1 format2 < input > output";
118 $in = Bio::SeqIO->newFh(-format => $format1, -fh => \*ARGV );
119 $out = Bio::SeqIO->newFh(-format => $format2 );
120 # Note: you might want to quote -format to keep older
121 # perl's from complaining.
123 print $out $_ while <$in>;
128 =head2 Bio::SeqIO-E<gt>new()
130 $seqIO = Bio::SeqIO->new(-file => 'filename', -format=>$format);
131 $seqIO = Bio::SeqIO->new(-fh => \*FILEHANDLE, -format=>$format);
132 $seqIO = Bio::SeqIO->new(-format => $format);
134 The new() class method constructs a new Bio::SeqIO object. The
135 returned object can be used to retrieve or print Seq objects. new()
136 accepts the following parameters:
142 A file path to be opened for reading or writing. The usual Perl
145 'file' # open file for reading
146 '>file' # open file for writing
147 '>>file' # open file for appending
148 '+<file' # open file read/write
149 'command |' # open a pipe from the command
150 '| command' # open a pipe to the command
154 You may provide new() with a previously-opened filehandle. For
155 example, to read from STDIN:
157 $seqIO = Bio::SeqIO->new(-fh => \*STDIN);
159 Note that you must pass filehandles as references to globs.
161 If neither a filehandle nor a filename is specified, then the module
162 will read from the @ARGV array or STDIN, using the familiar E<lt>E<gt>
165 A string filehandle is handy if you want to modify the output in the
166 memory, before printing it out. The following program reads in EMBL
167 formatted entries from a file and prints them out in fasta format with
172 my $in = Bio::SeqIO->new(-file => "emblfile",
174 while ( my $seq = $in->next_seq() ) {
175 # the output handle is reset for every file
176 my $stringio = IO::String->new($string);
177 my $out = Bio::SeqIO->new(-fh => $stringio,
179 # output goes into $string
180 $out->write_seq($seq);
182 $string =~ s|(>)(\w+)|$1<font color="Red">$2</font>|g;
189 Specify the format of the file. Supported formats include fasta,
190 genbank, embl, swiss (SwissProt), Entrez Gene and tracefile formats
191 such as abi (ABI) and scf. There are many more, for a complete listing
192 see the SeqIO HOWTO (L<http://bioperl.open-bio.org/wiki/HOWTO:SeqIO>).
194 If no format is specified and a filename is given then the module will
195 attempt to deduce the format from the filename suffix. If there is no
196 suffix that Bioperl understands then it will attempt to guess the
197 format based on file content. If this is unsuccessful then Fasta
200 The format name is case-insensitive: 'FASTA', 'Fasta' and 'fasta' are
203 Currently, the tracefile formats (except for SCF) require installation
204 of the external Staden "io_lib" package, as well as the
205 Bio::SeqIO::staden::read package available from the bioperl-ext
210 Sets the alphabet ('dna', 'rna', or 'protein'). When the alphabet is
211 set then Bioperl will not attempt to guess what the alphabet is. This
212 may be important because Bioperl does not always guess correctly.
216 By default, all files (or filehandles) opened for writing sequences
217 will be flushed after each write_seq() (making the file immediately
218 usable). If you do not need this facility and would like to marginally
219 improve the efficiency of writing multiple sequences to the same file
220 (or filehandle), pass the -flush option '0' or any other value that
221 evaluates as defined but false:
223 my $gb = Bio::SeqIO->new(-file => "<gball.gbk",
225 my $fa = Bio::SeqIO->new(-file => ">gball.fa",
227 -flush => 0); # go as fast as we can!
228 while($seq = $gb->next_seq) { $fa->write_seq($seq) }
233 =head2 Bio::SeqIO-E<gt>newFh()
235 $fh = Bio::SeqIO->newFh(-fh => \*FILEHANDLE, -format=>$format);
236 $fh = Bio::SeqIO->newFh(-format => $format);
239 This constructor behaves like new(), but returns a tied filehandle
240 rather than a Bio::SeqIO object. You can read sequences from this
241 object using the familiar E<lt>E<gt> operator, and write to it using
242 print(). The usual array and $_ semantics work. For example, you can
243 read all sequence objects into an array like this:
247 Other operations, such as read(), sysread(), write(), close(), and
248 printf() are not supported.
250 =head1 OBJECT METHODS
252 See below for more detailed summaries. The main methods are:
254 =head2 $sequence = $seqIO-E<gt>next_seq()
256 Fetch the next sequence from the stream.
258 =head2 $seqIO-E<gt>write_seq($sequence [,$another_sequence,...])
260 Write the specified sequence(s) to the stream.
262 =head2 TIEHANDLE(), READLINE(), PRINT()
264 These provide the tie interface. See L<perltie> for more details.
270 User feedback is an integral part of the evolution of this and other
271 Bioperl modules. Send your comments and suggestions preferably to one
272 of the Bioperl mailing lists.
274 Your participation is much appreciated.
276 bioperl-l@bioperl.org - General discussion
277 http://bioperl.org/wiki/Mailing_lists - About the mailing lists
279 =head2 Reporting Bugs
281 Report bugs to the Bioperl bug tracking system to help us keep track
282 the bugs and their resolution. Bug reports can be submitted via the
285 http://bugzilla.open-bio.org/
287 =head1 AUTHOR - Ewan Birney, Lincoln Stein
289 Email birney@ebi.ac.uk
294 The rest of the documentation details each of the object
295 methods. Internal methods are usually preceded with a _
299 #' Let the code begin...
305 use Bio
::Factory
::FTLocationFactory
;
306 use Bio
::Seq
::SeqBuilder
;
307 use Bio
::Tools
::GuessSeqFormat
;
310 use base
qw(Bio::Root::Root Bio::Root::IO Bio::Factory::SequenceStreamI);
312 my %valid_alphabet_cache;
317 Usage : $stream = Bio::SeqIO->new(-file => $filename,
319 Function: Returns a new sequence stream
320 Returns : A Bio::SeqIO stream initialised with the appropriate format
321 Args : Named parameters:
323 -fh => filehandle to attach to
326 Additional arguments may be used to set factories and
327 builders involved in the sequence object creation. None of
328 these must be provided, they all have reasonable defaults.
329 -seqfactory the Bio::Factory::SequenceFactoryI object
330 -locfactory the Bio::Factory::LocationFactoryI object
331 -objbuilder the Bio::Factory::ObjectBuilderI object
333 See L<Bio::SeqIO::Handler>
340 my ($caller,@args) = @_;
341 my $class = ref($caller) || $caller;
343 # or do we want to call SUPER on an object if $caller is an
345 if( $class =~ /Bio::SeqIO::(\S+)/ ) {
346 my ($self) = $class->SUPER::new
(@args);
347 $self->_initialize(@args);
352 @param{ map { lc $_ } keys %param } = values %param; # lowercase keys
354 if (!defined($param{-file
}) && !defined($param{-fh
})) {
355 $class->throw("file argument provided, but with an undefined value") if exists($param{'-file'});
356 $class->throw("fh argument provided, but with an undefined value") if (exists($param{'-fh'}));
359 my $format = $param{'-format'} ||
360 $class->_guess_format( $param{-file
} || $ARGV[0] );
364 $format = Bio
::Tools
::GuessSeqFormat
->new(-file
=> $param{-file
}||$ARGV[0] )->guess;
365 } elsif ($param{-fh
}) {
366 $format = Bio
::Tools
::GuessSeqFormat
->new(-fh
=> $param{-fh
}||$ARGV[0] )->guess;
369 $format = "\L$format"; # normalize capitalization to lower case
370 $class->throw("Unknown format given or could not determine it [$format]")
372 return unless( $class->_load_format_module($format) );
373 return "Bio::SeqIO::$format"->new(@args);
380 Usage : $fh = Bio::SeqIO->newFh(-file=>$filename,-format=>'Format')
381 Function: does a new() followed by an fh()
382 Example : $fh = Bio::SeqIO->newFh(-file=>$filename,-format=>'Format')
383 $sequence = <$fh>; # read a sequence object
384 print $fh $sequence; # write a sequence object
385 Returns : filehandle tied to the Bio::SeqIO::Fh class
388 See L<Bio::SeqIO::Fh>
394 return unless my $self = $class->new(@_);
403 Example : $fh = $obj->fh; # make a tied filehandle
404 $sequence = <$fh>; # read a sequence object
405 print $fh $sequence; # write a sequence object
406 Returns : filehandle tied to Bio::SeqIO class
414 my $class = ref($self) || $self;
415 my $s = Symbol
::gensym
;
416 tie
$$s,$class,$self;
420 # _initialize is chained for all SeqIO classes
423 my($self, @args) = @_;
425 # flush is initialized by the Root::IO init
427 my ($seqfact,$locfact,$objbuilder, $alphabet) =
428 $self->_rearrange([qw(SEQFACTORY
434 $locfact = Bio
::Factory
::FTLocationFactory
->new(-verbose
=> $self->verbose)
436 $objbuilder = Bio
::Seq
::SeqBuilder
->new(-verbose
=> $self->verbose)
438 $self->sequence_builder($objbuilder);
439 $self->location_factory($locfact);
441 # note that this should come last because it propagates the sequence
442 # factory to the sequence builder
443 $seqfact && $self->sequence_factory($seqfact);
446 $alphabet && $self->alphabet($alphabet);
449 # initialize the IO part
450 $self->_initialize_io(@args);
456 Usage : $seq = stream->next_seq
457 Function: Reads the next sequence object from the stream and returns it.
459 Certain driver modules may encounter entries in the stream
460 that are either misformatted or that use syntax not yet
461 understood by the driver. If such an incident is
462 recoverable, e.g., by dismissing a feature of a feature
463 table or some other non-mandatory part of an entry, the
464 driver will issue a warning. In the case of a
465 non-recoverable situation an exception will be thrown. Do
466 not assume that you can resume parsing the same stream
467 after catching the exception. Note that you can always turn
468 recoverable errors into exceptions by calling
471 Returns : a Bio::Seq sequence object
474 See L<Bio::Root::RootI>, L<Bio::Factory::SeqStreamI>, L<Bio::Seq>
479 my ($self, $seq) = @_;
480 $self->throw("Sorry, you cannot read from a generic Bio::SeqIO object.");
486 Usage : $stream->write_seq($seq)
487 Function: writes the $seq object into the stream
488 Returns : 1 for success and 0 for error
489 Args : Bio::Seq object
494 my ($self, $seq) = @_;
495 $self->throw("Sorry, you cannot write to a generic Bio::SeqIO object.");
502 Usage : $self->alphabet($newval)
503 Function: Set/get the molecule type for the Seq objects to be created.
504 Example : $seqio->alphabet('protein')
505 Returns : value of alphabet: 'dna', 'rna', or 'protein'
506 Args : newvalue (optional)
507 Throws : Exception if the argument is not one of 'dna', 'rna', or 'protein'
512 my ($self, $value) = @_;
514 if ( defined $value) {
516 unless ($valid_alphabet_cache{$value}) {
517 # instead of hard-coding the allowed values once more, we check by
518 # creating a dummy sequence object
520 require Bio
::PrimarySeq
;
521 my $seq = Bio
::PrimarySeq
->new('-verbose' => $self->verbose,
522 '-alphabet' => $value);
525 $self->throw("Invalid alphabet: $value\n. See Bio::PrimarySeq for allowed values.");
527 $valid_alphabet_cache{$value} = 1;
529 $self->{'alphabet'} = $value;
531 return $self->{'alphabet'};
534 =head2 _load_format_module
536 Title : _load_format_module
537 Usage : *INTERNAL SeqIO stuff*
538 Function: Loads up (like use) a module at run time on demand
545 sub _load_format_module
{
546 my ($self, $format) = @_;
547 my $module = "Bio::SeqIO::" . $format;
551 $ok = $self->_load_module($module);
555 $self: $format cannot be found
557 For more information about the SeqIO system please see the SeqIO docs.
558 This includes ways of checking for formats at compile time, not run time
565 =head2 _concatenate_lines
567 Title : _concatenate_lines
568 Usage : $s = _concatenate_lines($line, $continuation_line)
569 Function: Private. Concatenates two strings assuming that the second stems
570 from a continuation line of the first. Adds a space between both
571 unless the first ends with a dash.
573 Takes care of either arg being empty.
580 sub _concatenate_lines
{
581 my ($self, $s1, $s2) = @_;
583 $s1 .= " " if($s1 && ($s1 !~ /-$/) && $s2);
584 return ($s1 ?
$s1 : "") . ($s2 ?
$s2 : "");
590 Usage : $obj->_filehandle($newval)
591 Function: This method is deprecated. Call _fh() instead.
593 Returns : value of _filehandle
594 Args : newvalue (optional)
599 my ($self,@args) = @_;
600 return $self->_fh(@args);
605 Title : _guess_format
606 Usage : $obj->_guess_format($filename)
607 Function: guess format based on file suffix
609 Returns : guessed format of filename (lower case)
611 Notes : formats that _filehandle() will guess include fasta,
612 genbank, scf, pir, embl, raw, gcg, ace, bsml, swissprot,
619 return unless $_ = shift;
620 return 'abi' if /\.ab[i1]$/i;
621 return 'ace' if /\.ace$/i;
622 return 'alf' if /\.alf$/i;
623 return 'bsml' if /\.(bsm|bsml)$/i;
624 return 'ctf' if /\.ctf$/i;
625 return 'embl' if /\.(embl|ebl|emb|dat)$/i;
626 return 'entrezgene' if /\.asn$/i;
627 return 'exp' if /\.exp$/i;
628 return 'fasta' if /\.(fasta|fast|fas|seq|fa|fsa|nt|aa|fna|faa)$/i;
629 return 'fastq' if /\.fastq$/i;
630 return 'gcg' if /\.gcg$/i;
631 return 'genbank' if /\.(gb|gbank|genbank|gbk|gbs)$/i;
632 return 'phd' if /\.(phd|phred)$/i;
633 return 'pir' if /\.pir$/i;
634 return 'pln' if /\.pln$/i;
635 return 'raw' if /\.(txt)$/i;
636 return 'scf' if /\.scf$/i;
637 return 'swiss' if /\.(swiss|sp)$/i;
639 # from Strider 1.4 Release Notes: The file name extensions used by
640 # Strider 1.4 are ".xdna", ".xdgn", ".xrna" and ".xprt" for DNA,
641 # DNA Degenerate, RNA and Protein Sequence Files, respectively
642 return 'strider' if /\.(xdna|xdgn|xrna|xprt)$/i;
644 return 'ztr' if /\.ztr$/i;
653 my ($class,$val) = @_;
654 return bless {'seqio' => $val}, $class;
659 return $self->{'seqio'}->next_seq() unless wantarray;
661 push @list, $obj while $obj = $self->{'seqio'}->next_seq();
667 $self->{'seqio'}->write_seq(@_);
670 =head2 sequence_factory
672 Title : sequence_factory
673 Usage : $seqio->sequence_factory($seqfactory)
674 Function: Get/Set the Bio::Factory::SequenceFactoryI
675 Returns : Bio::Factory::SequenceFactoryI
676 Args : [optional] Bio::Factory::SequenceFactoryI
680 sub sequence_factory
{
681 my ($self,$obj) = @_;
683 if( ! ref($obj) || ! $obj->isa('Bio::Factory::SequenceFactoryI') ) {
684 $self->throw("Must provide a valid Bio::Factory::SequenceFactoryI object to ".ref($self)."::sequence_factory()");
686 $self->{'_seqio_seqfactory'} = $obj;
687 my $builder = $self->sequence_builder();
688 if($builder && $builder->can('sequence_factory') &&
689 (! $builder->sequence_factory())) {
690 $builder->sequence_factory($obj);
693 $self->{'_seqio_seqfactory'};
696 =head2 object_factory
698 Title : object_factory
699 Usage : $obj->object_factory($newval)
700 Function: This is an alias to sequence_factory with a more generic name.
702 Returns : value of object_factory (a scalar)
703 Args : on set, new value (a scalar or undef, optional)
708 return shift->sequence_factory(@_);
711 =head2 sequence_builder
713 Title : sequence_builder
714 Usage : $seqio->sequence_builder($seqfactory)
715 Function: Get/Set the Bio::Factory::ObjectBuilderI used to build sequence
718 If you do not set the sequence object builder yourself, it
719 will in fact be an instance of L<Bio::Seq::SeqBuilder>, and
720 you may use all methods documented there to configure it.
722 Returns : a Bio::Factory::ObjectBuilderI compliant object
723 Args : [optional] a Bio::Factory::ObjectBuilderI compliant object
727 sub sequence_builder
{
728 my ($self,$obj) = @_;
730 if( ! ref($obj) || ! $obj->isa('Bio::Factory::ObjectBuilderI') ) {
731 $self->throw("Must provide a valid Bio::Factory::ObjectBuilderI object to ".ref($self)."::sequence_builder()");
733 $self->{'_object_builder'} = $obj;
735 $self->{'_object_builder'};
738 =head2 location_factory
740 Title : location_factory
741 Usage : $seqio->location_factory($locfactory)
742 Function: Get/Set the Bio::Factory::LocationFactoryI object to be used for
743 location string parsing
744 Returns : a Bio::Factory::LocationFactoryI implementing object
745 Args : [optional] on set, a Bio::Factory::LocationFactoryI implementing
750 sub location_factory
{
751 my ($self,$obj) = @_;
753 if( ! ref($obj) || ! $obj->isa('Bio::Factory::LocationFactoryI') ) {
754 $self->throw("Must provide a valid Bio::Factory::LocationFactoryI" .
755 " object to ".ref($self)."->location_factory()");
757 $self->{'_seqio_locfactory'} = $obj;
759 $self->{'_seqio_locfactory'};