bump rc version
[bioperl-live.git] / Bio / Seq / EncodedSeq.pm
blobcad3bd8fe2f6ca518755c05ad46ef6e2a1da2291
2 # BioPerl module for Bio::Seq::EncodedSeq
4 # Please direct questions and support issues to <bioperl-l@bioperl.org>
6 # Cared for by Aaron Mackey
8 # Copyright Aaron Mackey
10 # You may distribute this module under the same terms as perl itself
12 # POD documentation - main docs before the code
14 =head1 NAME
16 Bio::Seq::EncodedSeq - subtype of L<Bio::LocatableSeq|Bio::LocatableSeq> to store DNA that encodes a protein
18 =head1 SYNOPSIS
20 $obj = Bio::Seq::EncodedSeq->new( -seq => $dna,
21 -encoding => "CCCCCCCIIIIICCCCC",
22 -start => 1,
23 -strand => 1,
24 -length => 17 );
26 # splice out (and possibly revcomp) the coding sequence
27 $cds = obj->cds;
29 # obtain the protein translation of the sequence
30 $prot = $obj->translate;
32 # other access/inspection routines as with Bio::LocatableSeq and
33 # Bio::SeqI; note that coordinates are relative only to the DNA
34 # sequence, not it's implicit encoded protein sequence.
36 =head1 DESCRIPTION
38 Bio::Seq::EncodedSeq is a L<Bio::LocatableSeq|Bio::LocatableSeq>
39 object that holds a DNA sequence as well as information about the
40 coding potential of that DNA sequence. It is meant to be useful in an
41 alignment context, where the DNA may contain frameshifts, gaps and/or
42 introns, or in describing the transcript of a gene. An EncodedSeq
43 provides the ability to access the "spliced" coding sequence, meaning
44 that all introns and gaps are removed, and any frameshifts are
45 adjusted to provide a "clean" CDS.
47 In order to make simultaneous use of either the DNA or the implicit
48 encoded protein sequence coordinates, please see
49 L<Bio::Coordinate::EncodingPair>.
51 =head1 ENCODING
53 We use the term "encoding" here to refer to the series of symbols that
54 we use to identify which residues of a DNA sequence are protein-coding
55 (i.e. part of a codon), intronic, part of a 5' or 3', frameshift
56 "mutations", etc. From this information, a Bio::Seq::EncodedSeq is
57 able to "figure out" its translational CDS. There are two sets of
58 coding characters, one termed "implicit" and one termed "explicit".
60 The "implicit" encoding is a bit simpler than the "explicit" encoding:
61 'C' is used for any nucleotide that's part of a codon, 'U' for any
62 UTR, etc. The full list is shown below:
64 Code Meaning
65 ---- -------
66 C coding
67 I intronic
68 U untranslated
69 G gapped (for use in alignments)
70 F a "forward", +1 frameshift
71 B a "backward", -1 frameshift
73 The "explicit" encoding is just an expansion of the "implicit"
74 encoding, to denote phase:
76 Code Meaning
77 ---- -------
78 C coding, 1st codon position
79 D coding, 2nd codon position
80 E coding, 3rd codon position
82 I intronic, phase 0 (relative to intron begin)
83 J intronic, phase 1
84 K intronic, phase 2
86 U untranslated 3'UTR
87 V untranslated 5'UTR
89 G gapped (for use in alignments)
90 F a "forward", +1 frameshift
91 B a "backward", -1 frameshift
93 Note that the explicit coding is meant to provide easy access to
94 position/phase specific nucleotides:
96 $obj = Bio::Seq::EncodedSeq->new(-seq => "ACAATCAGACTACG...",
97 -encoding => "CCCCCCIII..."
100 # fetch arrays of nucleotides at each codon position:
101 my @pos1 = $obj->dnaseq(encoding => 'C', explicit => 1);
102 my @pos2 = $obj->dnaseq(encoding => 'D');
103 my @pos3 = $obj->dnaseq(encoding => 'E');
105 # fetch arrays of "3-1" codon dinucleotides, useful for genomic
106 # signature analyses without compounding influences of codon bias:
107 my @pairs = $obj->dnaseq(encoding => 'EC');
109 =head1 FEEDBACK
111 =head2 Mailing Lists
113 User feedback is an integral part of the evolution of this and other
114 Bioperl modules. Send your comments and suggestions preferably to one
115 of the Bioperl mailing lists. Your participation is much appreciated.
117 bioperl-l@bioperl.org - General discussion
118 http://bioperl.org/wiki/Mailing_lists - About the mailing lists
120 =head2 Support
122 Please direct usage questions or support issues to the mailing list:
124 I<bioperl-l@bioperl.org>
126 rather than to the module maintainer directly. Many experienced and
127 reponsive experts will be able look at the problem and quickly
128 address it. Please include a thorough description of the problem
129 with code and data examples if at all possible.
131 =head2 Reporting Bugs
133 Report bugs to the Bioperl bug tracking system to help us keep track
134 the bugs and their resolution. Bug reports can be submitted via the
135 web:
137 https://github.com/bioperl/bioperl-live/issues
139 =head1 AUTHOR - Aaron Mackey
141 Email amackey@virginia.edu
143 =head1 APPENDIX
145 The rest of the documentation details each of the object
146 methods. Internal methods are usually preceded with a _
148 =cut
152 package Bio::Seq::EncodedSeq;
154 use strict;
156 use base qw(Bio::LocatableSeq);
159 =head2 new
161 Title : new
162 Usage : $obj = Bio::Seq::EncodedSeq->new(-seq => "AGTACGTGTCATG",
163 -encoding => "CCCCCCFCCCCCC",
164 -id => "myseq",
165 -start => 1,
166 -end => 13,
167 -strand => 1
169 Function: creates a new Bio::Seq::EncodedSeq object from a supplied DNA
170 sequence
171 Returns : a new Bio::Seq::EncodedSeq object
173 Args : seq - primary nucleotide sequence used to encode the
174 protein; note that any positions involved in a
175 gap ('G') or backward frameshift ('B') should
176 have one or more gap characters; if the encoding
177 specifies G or B, but no (or not enough) gap
178 characters exist, *they'll be added*; similary,
179 if there are gap characters without a
180 corresponding G or B encoding, G's will be
181 inserted into the encoding. This allows some
182 flexibility in specifying your sequence and
183 coding without having to calculate a lot of the
184 encoding for yourself.
186 encoding - a string of characters (see Encoding Table)
187 describing backwards frameshifts implied by the
188 encoding but not present in the sequence will be
189 added (as '-'s) to the sequence. If not
190 supplied, it will be assumed that all positions
191 are coding (C). Encoding may include either
192 implicit phase encoding characters (i.e. "CCC")
193 and/or explicit encoding characters (i.e. "CDE").
194 Additionally, prefixed numbers may be used to
195 denote repetition (i.e. "27C3I28C").
197 Alternatively, encoding may be a hashref
198 datastructure, with encoding characters as keys
199 and Bio::LocationI objects (or arrayrefs of
200 Bio::LocationI objects) as values, e.g.:
202 { C => [ Bio::Location::Simple->new(1,9),
203 Bio::Location::Simple->new(11,13) ],
204 F => Bio::Location::Simple->new(10,10),
205 } # same as "CCCCCCCCCFCCC"
207 Note that if the location ranges overlap, the
208 behavior of the encoding will be undefined
209 (well, it will be defined, but only according to
210 the order in which the hash keys are read, which
211 is basically undefined ... just don't do that).
213 id, start, end, strand - as with Bio::LocatableSeq; note
214 that the coordinates are relative to the
215 encoding DNA sequence, not the implicit protein
216 sequence. If strand is reversed, then the
217 encoding is assumed to be relative to the
218 reverse strand as well.
220 =cut
222 sub new {
223 my ($self, @args) = @_;
224 $self = $self->SUPER::new(@args, -alphabet => 'dna');
225 my ($enc) = $self->_rearrange([qw(ENCODING)], @args);
226 # set the real encoding:
227 if ($enc) {
228 $self->encoding($enc);
229 } else {
230 $self->_recheck_encoding;
232 return $self;
236 =head2 encoding
238 Title : encoding
239 Usage : $obj->encoding("CCCCCC");
240 $obj->encoding( -encoding => { I => $location } );
241 $enc = $obj->encoding(-explicit => 1);
242 $enc = $obj->encoding("CCCCCC", -explicit => 1);
243 $enc = $obj->encoding(-location => $location,
244 -explicit => 1,
245 -absolute => 1 );
246 Function: get/set the objects encoding, either globally or by location(s).
247 Returns : the (possibly new) encoding string.
248 Args : encoding - see the encoding argument to the new() function.
250 explicit - whether or not to return explicit phase
251 information in the coding (i.e. "CCC" becomes
252 "CDE", "III" becomes "IJK", etc); defaults to 0.
254 location - optional; location to get/set the encoding.
255 Defaults to the entire sequence.
257 absolute - whether or not the locational elements (either
258 in the encoding hashref or the location
259 argument) are relative to the absolute start/end
260 of the Bio::LocatableSeq, or to the internal,
261 relative coordinate system (beginning at 1);
262 defaults to 0 (i.e. relative)
264 =cut
266 sub encoding {
267 my ($self, @args) = @_;
268 my ($enc, $loc, $exp, $abs) = $self->_rearrange([qw(ENCODING LOCATION EXPLICIT ABSOLUTE)], @args);
270 if (!$enc) {
271 # do nothing; _recheck_encoding will fix for us, if necessary
272 } elsif (ref $enc eq 'HASH') {
273 $self->throw( -class => 'Bio::Root::NotImplemented',
274 -text => "Hashref functionality not yet implemented;\nplease email me if you really need this.");
275 #TODO: finish all this
276 while (my ($char, $locs) = each %$enc) {
277 if (ref $locs eq 'ARRAY') {
278 } elsif (UNIVERSAL::isa($locs, "Bio::LocationI")) {
279 } else {
280 $self->throw("Only a scalar or a ref to a hash; not a ref to a @{{lc ref $enc}}");
283 } elsif (! ref $enc) {
284 $enc = uc $enc;
285 $exp = 1 if (!defined $exp && $enc =~ m/[DEJKV]/o);
287 if ($enc =~ m/\d/o) { # numerically "enhanced" encoding
288 my $numenc = $enc;
289 $enc = '';
290 while ($numenc =~ m/\G(\d*)([CDEIJKUVGFB])/g) {
291 my ($num, $char) = ($1, $2);
292 $num = 1 unless $num;
293 $enc .= $char x $num;
297 if (defined $exp && $exp == 0 && $enc =~ m/([^CIUGFB])/) {
298 $self->throw("Unrecognized character '$1' in implicit encoding");
299 } elsif ($enc =~ m/[^CDEIJKUVGFB]/) {
300 $self->throw("Unrecognized character '$1' in explicit encoding");
303 if ($loc) { # a global location, over which to apply the specified encoding.
305 # balk if too many non-gap characters present in encoding:
306 my ($ct) = $enc =~ tr/GB/GB/;
307 $ct = length($enc) - $ct;
308 $self->throw("Location length doesn't match number of coding chars in encoding!")
309 if ($loc->location_type eq 'EXACT' && $loc->length != $ct);
311 my $start = $loc->start;
312 my $end = $loc->end;
314 # strip any encoding that hangs off the ends of the sequence:
315 if ($start < $self->start) {
316 my $diff = $self->start - $start;
317 $start = $self->start;
318 $enc = substr($enc, $diff);
320 if ($end > $self->end) {
321 my $diff = $end - $self->end;
322 $end = $self->end;
323 $enc = substr($enc, -$diff);
326 my $currenc = $self->{_encoding};
327 my $currseq = $self->seq;
329 my ($spanstart, $spanend) = ($self->column_from_residue_number($start),
330 $self->column_from_residue_number($end) );
332 if ($currseq) {
333 # strip any gaps in sequence spanned by this location:
334 ($spanstart, $spanend) = ($spanend, $spanstart) if $self->strand < 0;
335 my ($before, $in, $after) = $currseq =~ m/(.{@{[ $spanstart - ($loc->location_type eq 'IN-BETWEEN' ? 0 : 1) ]}})
336 (.{@{[ $spanend - $spanstart + ($loc->location_type eq 'IN-BETWEEN' ? -1 : 1) ]}})
337 (.*)
339 $in ||= '';
340 $in =~ s/[\.\-]+//g;
341 $currseq = ($before||'') . $in. ($after||'');
342 # change seq without changing the alphabet
343 $self->seq($currseq,$self->alphabet());
346 $currenc = reverse $currenc if $self->strand < 0;
347 substr($currenc, $spanstart, $spanend - $spanstart + ($loc->location_type eq 'IN-BETWEEN' ? -1 : 1),
348 $self->strand >= 0 ? $enc : reverse $enc);
349 $currenc = reverse $currenc if $self->strand < 0;
351 $self->{_encoding} = $currenc;
352 $self->_recheck_encoding;
354 $currenc = $self->{_encoding};
355 $currenc = reverse $currenc if $self->strand < 0;
356 $enc = substr($currenc, $spanstart, length $enc);
357 $enc = reverse $enc if $self->strand < 0;
359 return $exp ? $enc: $self->_convert2implicit($enc);
361 } else {
362 # presume a global redefinition; strip any current gap
363 # characters in the sequence so they don't corrupt the
364 # encoding
365 my $dna = $self->seq;
366 $dna =~ s/[\.\-]//g;
367 $self->seq($dna, 'dna');
368 $self->{_encoding} = $enc;
370 } else {
371 $self->throw("Only a scalar or a ref to a hash; not a ref to a @{{lc ref $enc}}");
374 $self->_recheck_encoding();
376 return $exp ? $self->{_encoding} : $self->_convert2implicit($self->{_encoding});
380 sub _convert2implicit {
381 my ($self, $enc) = @_;
383 $enc =~ s/[DE]/C/g;
384 $enc =~ s/[JK]/I/g;
385 $enc =~ s/V/U/g;
387 return $enc;
391 sub _recheck_encoding {
393 my $self = shift;
395 my @enc = split //, ($self->{_encoding} || '');
397 my @nt = split(//, $self->SUPER::seq);
398 @nt = reverse @nt if $self->strand && $self->strand < 0;
400 # make sure an encoding exists!
401 @enc = ('C') x scalar grep { !/[\.\-]/o } @nt
402 unless @enc;
404 # check for gaps to be truly present in the sequence
405 # and vice versa
406 my $i;
407 for ($i = 0 ; $i < @nt && $i < @enc ; $i++) {
408 if ($nt[$i] =~ /[\.\-]/o && $enc[$i] !~ m/[GB]/o) {
409 splice(@enc, $i, 0, 'G');
410 } elsif ($nt[$i] !~ /[\.\-]/o && $enc[$i] =~ m/[GB]/o) {
411 splice(@nt, $i, 0, '-');
414 if ($i < @enc) {
415 # extra encoding; presumably all gaps?
416 for ( ; $i < @enc ; $i++) {
417 if ($enc[$i] =~ m/[GB]/o) {
418 push @nt, '-';
419 } else {
420 $self->throw("Extraneous encoding info: " . join('', @enc[$i..$#enc]));
423 } elsif ($i < @nt) {
424 for ( ; $i < @nt ; $i++) {
425 if ($nt[$i] =~ m/[\.\-]/o) {
426 push @enc, 'G';
427 } else {
428 push @enc, 'C';
433 my @cde_array = qw(C D E);
434 my @ijk_array = qw(I J K);
435 # convert any leftover implicit coding into explicit coding
436 my ($Cct, $Ict, $Uct, $Vct, $Vwarned) = (0, 0, 0, 0);
437 for ($i = 0 ; $i < @enc ; $i++) {
438 if ($enc[$i] =~ m/[CDE]/o) {
439 my $temp_index = $Cct %3;
440 $enc[$i] = $cde_array[$temp_index];
441 $Cct++; $Ict = 0; $Uct = 1;
442 $self->warn("3' untranslated encoding (V) seen prior to other coding symbols")
443 if ($Vct && !$Vwarned++);
444 } elsif ($enc[$i] =~ m/[IJK]/o) {
445 $enc[$i] = $ijk_array[$Ict % 3];
446 $Ict++; $Uct = 1;
447 $self->warn("3' untranslated encoding (V) seen before other coding symbols")
448 if ($Vct && !$Vwarned++);
449 } elsif ($enc[$i] =~ m/[UV]/o) {
450 if ($Uct == 1) {
451 $enc[$i] = 'V';
452 $Vct = 1;
454 } elsif ($enc[$i] eq 'B') {
455 $Cct++; $Ict++
456 } elsif ($enc[$i] eq 'G') {
457 # gap; leave alone
461 @nt = reverse @nt if $self->strand && $self->strand < 0;
462 $self->seq(join('', @nt), 'dna');
464 $self->{_encoding} = join '', @enc;
468 =head2 cds
470 Title : cds
471 Usage : $cds = $obj->cds(-nogaps => 1);
472 Function: obtain the "spliced" DNA sequence, by removing any
473 nucleotides that participate in an UTR, forward frameshift
474 or intron, and replacing any unknown nucleotide implied by
475 a backward frameshift or gap with N's.
476 Returns : a Bio::Seq::EncodedSeq object, with an encoding consisting only
477 of "CCCC..".
478 Args : nogaps - strip any gap characters (resulting from 'G' or 'B'
479 encodings), rather than replacing them with N's.
481 =cut
483 sub cds {
484 my ($self, @args) = @_;
486 my ($nogaps, $loc) = $self->_rearrange([qw(NOGAPS LOCATION)], @args);
487 $nogaps = 0 unless defined $nogaps;
489 my @nt = split //, $self->strand < 0 ? $self->revcom->seq : $self->seq;
490 my @enc = split //, $self->_convert2implicit($self->{_encoding});
492 my ($start, $end) = (0, scalar @nt);
494 if ($loc) {
495 $start = $loc->start;
496 $start++ if $loc->location_type eq 'IN-BETWEEN';
497 $start = $self->column_from_residue_number($start);
498 $start--;
500 $end = $loc->end;
501 $end = $self->column_from_residue_number($end);
503 ($start, $end) = ($end, $start) if $self->strand < 0;
504 $start--;
507 for (my $i = $start ; $i < $end ; $i++) {
508 if ($enc[$i] eq 'I' || $enc[$i] eq 'U' || $enc[$i] eq 'F') {
509 # remove introns, untranslated and forward frameshift nucleotides
510 $nt[$i] = undef;
511 } elsif ($enc[$i] eq 'G' || $enc[$i] eq 'B') {
512 # replace gaps and backward frameshifts with N's, unless asked not to.
513 $nt[$i] = $nogaps ? undef : 'N';
517 return ($self->can_call_new ? ref($self) : __PACKAGE__)->new(
518 -seq => join('', grep { defined } @nt[$start..--$end]),
519 -start => $self->start,
520 -end => $self->end,
521 -strand => 1,
522 -alphabet => 'dna' );
526 =head2 translate
528 Title : translate
529 Usage : $prot = $obj->translate(@args);
530 Function: obtain the protein sequence encoded by the underlying DNA
531 sequence; same as $obj->cds()->translate(@args).
532 Returns : a Bio::PrimarySeq object.
533 Args : same as the translate() function of Bio::PrimarySeqI
535 =cut
537 sub translate { shift->cds(-nogaps => 1, @_)->SUPER::translate(@_) };
540 =head2 protseq
542 Title : seq
543 Usage : $protseq = $obj->protseq();
544 Function: obtain the raw protein sequence encoded by the underlying
545 DNA sequence; This is the same as calling
546 $obj->translate()->seq();
547 Returns : a string of single-letter amino acid codes
548 Args : same as the seq() function of Bio::PrimarySeq; note that this
549 function may not be used to set the protein sequence; see
550 the dnaseq() function for that.
552 =cut
554 sub protseq { shift->cds(-nogaps => 1, @_)->SUPER::translate(@_)->seq };
557 =head2 dnaseq
559 Title : dnaseq
560 Usage : $dnaseq = $obj->dnaseq();
561 $obj->dnaseq("ACGTGTCGT", "CCCCCCCCC");
562 $obj->dnaseq(-seq => "ATG",
563 -encoding => "CCC",
564 -location => $loc );
565 @introns = $obj->$dnaseq(-encoding => 'I')
566 Function: get/set the underlying DNA sequence; will overwrite any
567 current DNA and/or encoding information present.
568 Returns : a string of single-letter nucleotide codes, including any
569 gaps implied by the encoding.
570 Args : seq - the DNA sequence to be used as a replacement
571 encoding - the encoding of the DNA sequence (see the new()
572 constructor); defaults to all 'C' if setting a
573 new DNA sequence. If no new DNA sequence is
574 being provided, then the encoding is used as a
575 "filter" for which to return fragments of
576 non-overlapping DNA that match the encoding.
577 location - optional, the location of the DNA sequence to
578 get/set; defaults to the entire sequence.
580 =cut
582 sub dnaseq {
583 my ($self, @args) = @_;
584 my ($seq, $enc, $loc) = $self->_rearrange([qw(DNASEQ ENCODING LOCATION)], @args);
585 return $self;
589 # need to overload this so that we truncate both the seq and the encoding!
590 sub trunc {
591 my ($self, $start, $end) = @_;
592 my $new = $self->SUPER::trunc($start, $end);
593 $start--;
594 my $enc = $self->{_encoding};
595 $enc = reverse $enc if $self->strand < 0;
596 $enc = substr($enc, $start, $end - $start);
597 $enc = reverse $enc if $self->strand < 0;
598 $new->encoding($enc);
599 return $new;