1 # BioPerl module for Bio::SeqIO
3 # Please direct questions and support issues to <bioperl-l@bioperl.org>
5 # Cared for by Ewan Birney <birney@ebi.ac.uk>
6 # and Lincoln Stein <lstein@cshl.org>
8 # Copyright Ewan Birney
10 # You may distribute this module under the same terms as perl itself
13 # October 18, 1999 Largely rewritten by Lincoln Stein
15 # POD documentation - main docs before the code
19 Bio::SeqIO - Handler for SeqIO Formats
25 $in = Bio::SeqIO->new(-file => "inputfilename" ,
27 $out = Bio::SeqIO->new(-file => ">outputfilename" ,
30 while ( my $seq = $in->next_seq() ) {
31 $out->write_seq($seq);
34 # Now, to actually get at the sequence object, use the standard Bio::Seq
35 # methods (look at Bio::Seq if you don't know what they are)
39 $in = Bio::SeqIO->new(-file => "inputfilename" ,
40 -format => 'genbank');
42 while ( my $seq = $in->next_seq() ) {
43 print "Sequence ",$seq->id, " first 10 bases ",
44 $seq->subseq(1,10), "\n";
48 # The SeqIO system does have a filehandle binding. Most people find this
49 # a little confusing, but it does mean you can write the world's
50 # smallest reformatter
54 $in = Bio::SeqIO->newFh(-file => "inputfilename" ,
56 $out = Bio::SeqIO->newFh(-format => 'EMBL');
58 # World's shortest Fasta<->EMBL format converter:
59 print $out $_ while <$in>;
64 Bio::SeqIO is a handler module for the formats in the SeqIO set (eg,
65 Bio::SeqIO::fasta). It is the officially sanctioned way of getting at
66 the format objects, which most people should use.
68 The Bio::SeqIO system can be thought of like biological file handles.
69 They are attached to filehandles with smart formatting rules (eg,
70 genbank format, or EMBL format, or binary trace file format) and
71 can either read or write sequence objects (Bio::Seq objects, or
72 more correctly, Bio::SeqI implementing objects, of which Bio::Seq is
73 one such object). If you want to know what to do with a Bio::Seq
74 object, read L<Bio::Seq>.
76 The idea is that you request a stream object for a particular format.
77 All the stream objects have a notion of an internal file that is read
78 from or written to. A particular SeqIO object instance is configured
79 for either input or output. A specific example of a stream object is
80 the Bio::SeqIO::fasta object.
82 Each stream object has functions
88 $stream->write_seq($seq);
90 As an added bonus, you can recover a filehandle that is tied to the
91 SeqIO object, allowing you to use the standard E<lt>E<gt> and print
92 operations to read and write sequence objects:
96 $stream = Bio::SeqIO->newFh(-format => 'Fasta',
98 # read from standard input or the input filenames
100 while ( $seq = <$stream> ) {
101 # do something with $seq
106 print $stream $seq; # when stream is in output mode
108 This makes the simplest ever reformatter
113 my $format2 = shift || die
114 "Usage: reformat format1 format2 < input > output";
118 my $in = Bio::SeqIO->newFh(-format => $format1, -fh => \*ARGV );
119 my $out = Bio::SeqIO->newFh(-format => $format2 );
120 # Note: you might want to quote -format to keep older
121 # perl's from complaining.
123 print $out $_ while <$in>;
128 =head2 Bio::SeqIO-E<gt>new()
130 $seqIO = Bio::SeqIO->new(-file => 'seqs.fasta', -format => $format);
131 $seqIO = Bio::SeqIO->new(-fh => \*FILEHANDLE, -format => $format);
132 $seqIO = Bio::SeqIO->new(-string => $string , -format => $format);
133 $seqIO = Bio::SeqIO->new(-format => $format);
135 The new() class method constructs a new Bio::SeqIO object. The returned object
136 can be used to retrieve or print Seq objects. new() accepts the following
143 A file path to be opened for reading or writing. The usual Perl
146 'file' # open file for reading
147 '>file' # open file for writing
148 '>>file' # open file for appending
149 '+<file' # open file read/write
150 'command |' # open a pipe from the command
151 '| command' # open a pipe to the command
155 You may use new() with a opened filehandle, provided as a glob reference. For
156 example, to read from STDIN:
158 my $seqIO = Bio::SeqIO->new(-fh => \*STDIN);
160 A string filehandle is handy if you want to modify the output in the
161 memory, before printing it out. The following program reads in EMBL
162 formatted entries from a file and prints them out in fasta format with
167 my $in = Bio::SeqIO->new(-file => "emblfile",
169 while ( my $seq = $in->next_seq() ) {
170 # the output handle is reset for every file
171 my $stringio = IO::String->new($string);
172 my $out = Bio::SeqIO->new(-fh => $stringio,
174 # output goes into $string
175 $out->write_seq($seq);
177 $string =~ s|(>)(\w+)|$1<font color="Red">$2</font>|g;
184 A string to read the sequences from. For example:
186 my $string = ">seq1\nACGCTAGCTAGC\n";
187 my $seqIO = Bio::SeqIO->new(-string => $string);
191 Specify the format of the file. Supported formats include fasta,
192 genbank, embl, swiss (SwissProt), Entrez Gene and tracefile formats
193 such as abi (ABI) and scf. There are many more, for a complete listing
194 see the SeqIO HOWTO (L<http://bioperl.open-bio.org/wiki/HOWTO:SeqIO>).
196 If no format is specified and a filename is given then the module will
197 attempt to deduce the format from the filename suffix. If there is no
198 suffix that Bioperl understands then it will attempt to guess the
199 format based on file content. If this is unsuccessful then SeqIO will
202 The format name is case-insensitive: 'FASTA', 'Fasta' and 'fasta' are
205 Currently, the tracefile formats (except for SCF) require installation
206 of the external Staden "io_lib" package, as well as the
207 Bio::SeqIO::staden::read package available from the bioperl-ext
212 Sets the alphabet ('dna', 'rna', or 'protein'). When the alphabet is
213 set then Bioperl will not attempt to guess what the alphabet is. This
214 may be important because Bioperl does not always guess correctly.
218 By default, all files (or filehandles) opened for writing sequences
219 will be flushed after each write_seq() (making the file immediately
220 usable). If you do not need this facility and would like to marginally
221 improve the efficiency of writing multiple sequences to the same file
222 (or filehandle), pass the -flush option '0' or any other value that
223 evaluates as defined but false:
225 my $gb = Bio::SeqIO->new(-file => "<gball.gbk",
227 my $fa = Bio::SeqIO->new(-file => ">gball.fa",
229 -flush => 0); # go as fast as we can!
230 while($seq = $gb->next_seq) { $fa->write_seq($seq) }
234 Provide a Bio::Factory::SequenceFactoryI object. See the sequence_factory() method.
238 Provide a Bio::Factory::LocationFactoryI object. See the location_factory() method.
242 Provide a Bio::Factory::ObjectBuilderI object. See the object_builder() method.
246 =head2 Bio::SeqIO-E<gt>newFh()
248 $fh = Bio::SeqIO->newFh(-fh => \*FILEHANDLE, -format=>$format);
249 $fh = Bio::SeqIO->newFh(-format => $format);
252 This constructor behaves like new(), but returns a tied filehandle
253 rather than a Bio::SeqIO object. You can read sequences from this
254 object using the familiar E<lt>E<gt> operator, and write to it using
255 print(). The usual array and $_ semantics work. For example, you can
256 read all sequence objects into an array like this:
260 Other operations, such as read(), sysread(), write(), close(), and
261 printf() are not supported.
263 =head1 OBJECT METHODS
265 See below for more detailed summaries. The main methods are:
267 =head2 $sequence = $seqIO-E<gt>next_seq()
269 Fetch the next sequence from the stream, or nothing if no more.
271 =head2 $seqIO-E<gt>write_seq($sequence [,$another_sequence,...])
273 Write the specified sequence(s) to the stream.
275 =head2 TIEHANDLE(), READLINE(), PRINT()
277 These provide the tie interface. See L<perltie> for more details.
283 User feedback is an integral part of the evolution of this and other
284 Bioperl modules. Send your comments and suggestions preferably to one
285 of the Bioperl mailing lists.
287 Your participation is much appreciated.
289 bioperl-l@bioperl.org - General discussion
290 http://bioperl.org/wiki/Mailing_lists - About the mailing lists
294 Please direct usage questions or support issues to the mailing list:
296 bioperl-l@bioperl.org
298 rather than to the module maintainer directly. Many experienced and
299 responsive experts will be able look at the problem and quickly
300 address it. Please include a thorough description of the problem
301 with code and data examples if at all possible.
303 =head2 Reporting Bugs
305 Report bugs to the Bioperl bug tracking system to help us keep track
306 the bugs and their resolution. Bug reports can be submitted via the
309 https://github.com/bioperl/bioperl-live/issues
311 =head1 AUTHOR - Ewan Birney, Lincoln Stein
313 Email birney@ebi.ac.uk
318 The rest of the documentation details each of the object
319 methods. Internal methods are usually preceded with a _
323 #' Let the code begin...
330 use Bio
::Factory
::FTLocationFactory
;
331 use Bio
::Seq
::SeqBuilder
;
332 use Bio
::Tools
::GuessSeqFormat
;
335 use parent
qw(Bio::Root::Root Bio::Root::IO Bio::Factory::SequenceStreamI);
337 my %valid_alphabet_cache;
343 Usage : $stream = Bio::SeqIO->new(-file => 'sequences.fasta',
345 Function: Returns a new sequence stream
346 Returns : A Bio::SeqIO stream initialised with the appropriate format
347 Args : Named parameters indicating where to read the sequences from or to
349 -file => filename, OR
350 -fh => filehandle to attach to, OR
353 Additional arguments, all with reasonable defaults:
354 -format => format of the sequences, usually auto-detected
355 -alphabet => 'dna', 'rna', or 'protein'
356 -flush => 0 or 1 (default: flush filehandles after each write)
357 -seqfactory => sequence factory
358 -locfactory => location factory
359 -objbuilder => object builder
361 See L<Bio::SeqIO::Handler>
368 my ($caller, @args) = @_;
369 my $class = ref($caller) || $caller;
371 # or do we want to call SUPER on an object if $caller is an
373 if( $class =~ /Bio::SeqIO::(\S+)/ ) {
374 my ($self) = $class->SUPER::new
(@args);
375 $self->_initialize(@args);
379 @params{ map { lc $_ } keys %params } = values %params; # lowercase keys
381 unless( defined $params{-file
} ||
382 defined $params{-fh
} ||
383 defined $params{-string
} ) {
384 $class->throw("file argument provided, but with an undefined value")
385 if exists $params{'-file'};
386 $class->throw("fh argument provided, but with an undefined value")
387 if exists $params{'-fh'};
388 $class->throw("string argument provided, but with an undefined value")
389 if exists($params{'-string'});
390 # $class->throw("No file, fh, or string argument provided"); # neither defined
393 # Determine or guess sequence format and variant
394 my $format = $params{'-format'};
396 if ($params{-file
}) {
397 # Guess from filename extension, and then from file content
398 $format = $class->_guess_format( $params{-file
} ) ||
399 Bio
::Tools
::GuessSeqFormat
->new(-file
=> $params{-file
} )->guess;
400 } elsif ($params{-fh
}) {
401 # Guess from filehandle content
402 $format = Bio
::Tools
::GuessSeqFormat
->new(-fh
=> $params{-fh
} )->guess;
403 } elsif ($params{-string
}) {
404 # Guess from string content
405 $format = Bio
::Tools
::GuessSeqFormat
->new(-text
=> $params{-string
})->guess;
409 # changed 1-3-11; no need to print out an empty string (only way this
410 # exception is triggered) - cjfields
411 $class->throw("Could not guess format from file, filehandle or string")
413 $format = "\L$format"; # normalize capitalization to lower case
415 if ($format =~ /-/) {
416 ($format, my $variant) = split('-', $format, 2);
417 $params{-variant
} = $variant;
420 return unless( $class->_load_format_module($format) );
421 return "Bio::SeqIO::$format"->new(%params);
429 Usage : $fh = Bio::SeqIO->newFh(-file=>$filename,-format=>'Format')
430 Function: Does a new() followed by an fh()
431 Example : $fh = Bio::SeqIO->newFh(-file=>$filename,-format=>'Format')
432 $sequence = <$fh>; # read a sequence object
433 print $fh $sequence; # write a sequence object
434 Returns : filehandle tied to the Bio::SeqIO::Fh class
437 See L<Bio::SeqIO::Fh>
443 return unless my $self = $class->new(@_);
452 Function: Get or set the IO filehandle
453 Example : $fh = $obj->fh; # make a tied filehandle
454 $sequence = <$fh>; # read a sequence object
455 print $fh $sequence; # write a sequence object
456 Returns : filehandle tied to Bio::SeqIO class
463 my $class = ref($self) || $self;
464 my $s = Symbol
::gensym
;
465 tie
$$s,$class,$self;
470 # _initialize is chained for all SeqIO classes
473 my($self, @args) = @_;
475 # flush is initialized by the Root::IO init
477 my ($seqfact,$locfact,$objbuilder, $alphabet) =
478 $self->_rearrange([qw(SEQFACTORY
484 $locfact = Bio
::Factory
::FTLocationFactory
->new(-verbose
=> $self->verbose)
486 $objbuilder = Bio
::Seq
::SeqBuilder
->new(-verbose
=> $self->verbose)
488 $self->sequence_builder($objbuilder);
489 $self->location_factory($locfact);
491 # note that this should come last because it propagates the sequence
492 # factory to the sequence builder
493 $seqfact && $self->sequence_factory($seqfact);
496 $alphabet && $self->alphabet($alphabet);
498 # initialize the IO part
499 $self->_initialize_io(@args);
506 Usage : $seq = stream->next_seq
507 Function: Reads the next sequence object from the stream and returns it.
509 Certain driver modules may encounter entries in the stream
510 that are either misformatted or that use syntax not yet
511 understood by the driver. If such an incident is
512 recoverable, e.g., by dismissing a feature of a feature
513 table or some other non-mandatory part of an entry, the
514 driver will issue a warning. In the case of a
515 non-recoverable situation an exception will be thrown. Do
516 not assume that you can resume parsing the same stream
517 after catching the exception. Note that you can always turn
518 recoverable errors into exceptions by calling
521 Returns : a Bio::Seq sequence object, or nothing if no more sequences
526 See L<Bio::Root::RootI>, L<Bio::Factory::SeqStreamI>, L<Bio::Seq>
531 my ($self, $seq) = @_;
532 $self->throw("Sorry, you cannot read from a generic Bio::SeqIO object.");
539 Usage : $stream->write_seq($seq)
540 Function: writes the $seq object into the stream
541 Returns : 1 for success and 0 for error
542 Args : Bio::Seq object
547 my ($self, $seq) = @_;
548 $self->throw("Sorry, you cannot write to a generic Bio::SeqIO object.");
555 Usage : $format = $stream->format()
556 Function: Get the sequence format
557 Returns : sequence format, e.g. fasta, fastq
562 # format() method inherited from Bio::Root::IO
568 Usage : $self->alphabet($newval)
569 Function: Set/get the molecule type for the Seq objects to be created.
570 Example : $seqio->alphabet('protein')
571 Returns : value of alphabet: 'dna', 'rna', or 'protein'
572 Args : newvalue (optional)
573 Throws : Exception if the argument is not one of 'dna', 'rna', or 'protein'
578 my ($self, $value) = @_;
580 if ( defined $value) {
582 unless ($valid_alphabet_cache{$value}) {
583 # instead of hard-coding the allowed values once more, we check by
584 # creating a dummy sequence object
586 require Bio
::PrimarySeq
;
587 my $seq = Bio
::PrimarySeq
->new( -verbose
=> $self->verbose,
588 -alphabet
=> $value );
591 $self->throw("Invalid alphabet: $value\n. See Bio::PrimarySeq for allowed values.");
593 $valid_alphabet_cache{$value} = 1;
595 $self->{'alphabet'} = $value;
597 return $self->{'alphabet'};
601 =head2 _load_format_module
603 Title : _load_format_module
604 Usage : *INTERNAL SeqIO stuff*
605 Function: Loads up (like use) a module at run time on demand
612 sub _load_format_module
{
613 my ($self, $format) = @_;
614 my $module = "Bio::SeqIO::" . $format;
618 $ok = $self->_load_module($module);
622 $self: $format cannot be found
624 For more information about the SeqIO system please see the SeqIO docs.
625 This includes ways of checking for formats at compile time, not run time
633 =head2 _concatenate_lines
635 Title : _concatenate_lines
636 Usage : $s = _concatenate_lines($line, $continuation_line)
637 Function: Private. Concatenates two strings assuming that the second stems
638 from a continuation line of the first. Adds a space between both
639 unless the first ends with a dash.
641 Takes care of either arg being empty.
648 sub _concatenate_lines
{
649 my ($self, $s1, $s2) = @_;
650 $s1 .= " " if($s1 && ($s1 !~ /-$/) && $s2);
651 return ($s1 ?
$s1 : "") . ($s2 ?
$s2 : "");
658 Usage : $obj->_filehandle($newval)
659 Function: This method is deprecated. Call _fh() instead.
661 Returns : value of _filehandle
662 Args : newvalue (optional)
667 my ($self,@args) = @_;
668 return $self->_fh(@args);
674 Title : _guess_format
675 Usage : $obj->_guess_format($filename)
676 Function: guess format based on file suffix
678 Returns : guessed format of filename (lower case)
680 Notes : formats that _filehandle() will guess include fasta,
681 genbank, scf, pir, embl, raw, gcg, ace, bsml, swissprot,
688 return unless $_ = shift;
690 return 'abi' if /\.ab[i1]$/i;
691 return 'ace' if /\.ace$/i;
692 return 'alf' if /\.alf$/i;
693 return 'bsml' if /\.(bsm|bsml)$/i;
694 return 'ctf' if /\.ctf$/i;
695 return 'embl' if /\.(embl|ebl|emb|dat)$/i;
696 return 'entrezgene' if /\.asn$/i;
697 return 'exp' if /\.exp$/i;
698 return 'fasta' if /\.(fasta|fast|fas|seq|fa|fsa|nt|aa|fna|faa)$/i;
699 return 'fastq' if /\.fastq$/i;
700 return 'gcg' if /\.gcg$/i;
701 return 'genbank' if /\.(gb|gbank|genbank|gbk|gbs)$/i;
702 return 'phd' if /\.(phd|phred)$/i;
703 return 'pir' if /\.pir$/i;
704 return 'pln' if /\.pln$/i;
705 return 'qual' if /\.qual$/i;
706 return 'raw' if /\.txt$/i;
707 return 'scf' if /\.scf$/i;
708 # from Strider 1.4 Release Notes: The file name extensions used by
709 # Strider 1.4 are ".xdna", ".xdgn", ".xrna" and ".xprt" for DNA,
710 # DNA Degenerate, RNA and Protein Sequence Files, respectively
711 return 'strider' if /\.(xdna|xdgn|xrna|xprt)$/i;
712 return 'swiss' if /\.(swiss|sp)$/i;
713 return 'ztr' if /\.ztr$/i;
724 my ($class,$val) = @_;
725 return bless {'seqio' => $val}, $class;
731 return $self->{'seqio'}->next_seq() || undef unless wantarray;
733 push @list, $obj while $obj = $self->{'seqio'}->next_seq();
740 $self->{'seqio'}->write_seq(@_);
744 =head2 sequence_factory
746 Title : sequence_factory
747 Usage : $seqio->sequence_factory($seqfactory)
748 Function: Get/Set the Bio::Factory::SequenceFactoryI
749 Returns : Bio::Factory::SequenceFactoryI
750 Args : [optional] Bio::Factory::SequenceFactoryI
754 sub sequence_factory
{
755 my ($self, $obj) = @_;
757 if( ! ref($obj) || ! $obj->isa('Bio::Factory::SequenceFactoryI') ) {
758 $self->throw("Must provide a valid Bio::Factory::SequenceFactoryI object to ".ref($self)."::sequence_factory()");
760 $self->{'_seqio_seqfactory'} = $obj;
761 my $builder = $self->sequence_builder();
762 if($builder && $builder->can('sequence_factory') &&
763 (! $builder->sequence_factory())) {
764 $builder->sequence_factory($obj);
767 $self->{'_seqio_seqfactory'};
771 =head2 object_factory
773 Title : object_factory
774 Usage : $obj->object_factory($newval)
775 Function: This is an alias to sequence_factory with a more generic name.
777 Returns : value of object_factory (a scalar)
778 Args : on set, new value (a scalar or undef, optional)
783 return shift->sequence_factory(@_);
787 =head2 sequence_builder
789 Title : sequence_builder
790 Usage : $seqio->sequence_builder($seqfactory)
791 Function: Get/Set the Bio::Factory::ObjectBuilderI used to build sequence
792 objects. This applies to rich sequence formats only, e.g. genbank
795 If you do not set the sequence object builder yourself, it
796 will in fact be an instance of L<Bio::Seq::SeqBuilder>, and
797 you may use all methods documented there to configure it.
799 Returns : a Bio::Factory::ObjectBuilderI compliant object
800 Args : [optional] a Bio::Factory::ObjectBuilderI compliant object
804 sub sequence_builder
{
805 my ($self, $obj) = @_;
807 if( ! ref($obj) || ! $obj->isa('Bio::Factory::ObjectBuilderI') ) {
808 $self->throw("Must provide a valid Bio::Factory::ObjectBuilderI object to ".ref($self)."::sequence_builder()");
810 $self->{'_object_builder'} = $obj;
812 $self->{'_object_builder'};
816 =head2 location_factory
818 Title : location_factory
819 Usage : $seqio->location_factory($locfactory)
820 Function: Get/Set the Bio::Factory::LocationFactoryI object to be used for
821 location string parsing
822 Returns : a Bio::Factory::LocationFactoryI implementing object
823 Args : [optional] on set, a Bio::Factory::LocationFactoryI implementing
828 sub location_factory
{
829 my ($self,$obj) = @_;
831 if( ! ref($obj) || ! $obj->isa('Bio::Factory::LocationFactoryI') ) {
832 $self->throw("Must provide a valid Bio::Factory::LocationFactoryI" .
833 " object to ".ref($self)."->location_factory()");
835 $self->{'_seqio_locfactory'} = $obj;
837 $self->{'_seqio_locfactory'};