*** empty log message ***
[bioperl-live.git] / Bio / Tools / Sigcleave.pm
blobf65c1d31a22fc682fe4846bb7a09044ed290e795
1 #-----------------------------------------------------------------------------
2 # PACKAGE : Bio::Tools::Sigcleave.pm
3 # AUTHOR : Chris Dagdigian, dag@sonsorol.org
4 # CREATED : Jan 28 1999
5 # REVISION: $Id$
6 #
7 # Copyright (c) 1997-9 bioperl, Chris Dagdigian and others. All Rights Reserved.
8 # This module is free software; you can redistribute it and/or
9 # modify it under the same terms as Perl itself.
11 # _History_
13 # Object framework ripped from Steve Chervits's SeqPattern.pm
15 # Core EGCG Sigcleave emulation from perl code developed by
16 # Danh Nguyen & Kamalakar Gulukota which itself was based
17 # loosely on Colgrove's signal.c program.
19 # The overall idea is to replicate the output of the sigcleave
20 # program which was distributed with the EGCG extension to the GCG sequence
21 # analysis package. There is also an accessor method for just getting at
22 # the raw results.
24 #-----------------------------------------------------------------------------
26 package Bio::Tools::Sigcleave;
28 use Bio::Root::Global qw(:devel);
29 use Bio::Seq ();
31 @ISA = qw(Bio::Seq);
32 use strict;
33 use vars qw ($ID $VERSION %WeightTable);
34 $ID = 'Bio::Tools::Sigcleave';
35 $VERSION = 0.01;
37 =head1 NAME
39 Bio::Tools::Sigcleave.pm - Bioperl object for sigcleave analysis
41 =head1 SYNOPSIS
43 =head2 Object Creation
45 use Bio::Tools::Sigcleave ();
47 $sigcleave_object = new Bio::Tools::Sigcleave(-file=>'sigtest.aa',
48 -desc=>'test sigcleave protein seq',
49 -type=>'AMINO',
50 -threshold=>'3.5',
53 Sigcleave objects can be created via the same methods as Bio::Seq objects. The
54 one additional parameter is "-threshold" which sets the score reporting limit
55 for the algorithim. The above exmple shows a sigcleave object being created
56 from a protein sequence file. See the Bio::Seq documention to see the other ways
57 that objects can be created.
59 =head2 Object Methods & Accessors
61 %raw_results = $sigcleave_object->signals;
63 $formatted_output = $sigcleave_object->pretty_print;
65 =head1 INSTALLATION
67 This module is included with the central Bioperl distribution:
69 http://bio.perl.org/Core/Latest
70 ftp://bio.perl.org/pub/DIST
72 Follow the installation instructions included in the README file.
74 =head1 DESCRIPTION
76 "Sigcleave" was a program distributed as part of the free EGCG add-on to
77 earlier versions of the GCG Sequence Analysis package.
79 From the EGCG documentation:
80 SigCleave uses the von Heijne method to locate signal sequences, and to identify
81 the cleavage site. The method is 95% accurate in resolving signal sequences from
82 non-signal sequences with a cutoff score of 3.5, and 75-80% accurate in identifying
83 the cleavage site. The program reports all hits above a minimum value.
85 The EGCG Sigcleave program was written by Peter Rice
86 (E-mail: pmr@sanger.ac.uk Post: Informatics Division, The Sanger Centre,
87 Wellcome Trust Genome Campus, Hinxton, Cambs, CB10 1SA, UK).
89 Since EGCG is no longer distributed for the latest versions of GCG, this code
90 was developed to emulate the output of the original program as much as possible for
91 those who lost access to sigcleave when upgrading to newer versions of GCG.
93 The EGCG website can be found at: http://www.sanger.ac.uk/Software/EGCG/
95 There are 2 accessor methods for this object. "signals" will return a perl
96 associative array containing the sigcleave scores keyed by amino acid position.
97 "pretty_print" returns a formatted string similar to the output of the original
98 sigcleave utility.
100 In both cases, the "threshold" setting controls the score reporting level. If no
101 value for threshold is passed in by the user, the code defaults to a reporting value
102 of 3.5.
104 In this implemntation the accessor will never return any score/position pair which does not
105 meet the threshold limit. This is the slightly different from the behaviour of
106 the 8.1 EGCG sigcleave program which will report the highest of the under-threshold
107 results if nothing else is found.
111 Example of pretty_print output:
113 SIGCLEAVE of sigtest from: 1 to 146
115 Report scores over 3.5
116 Maximum score 4.9 at residue 131
118 Sequence: FVILAAMSIQGSA-NLQTQWKSTASLALET
119 | (signal) | (mature peptide)
120 118 131
122 Other entries above 3.5
124 Maximum score 3.7 at residue 112
126 Sequence: CSRQLFGWLFCKV-HPGAIVFVILAAMSIQGSANLQTQWKSTASLALET
127 | (signal) | (mature peptide)
128 99 112
131 =head1 USAGE
133 No warranty implied or expressed. Use at your own risk :) Users unfamiliar
134 with the original Sigcleave application should read the von Heijne papers.
136 The emphasis here is on correctly replicating the calls that 8.1 EGCG sigcleave
137 would make. This code has been tested against a non-redundant curated set
138 of 405 Swissprot proteins representing secreted, non-secreted, membrane and
139 transit proteins. Except for the EGCG sigcleave habit of reporting an
140 under-threshold score if nothing better is found the output was identical.
142 The weight matrix in this code is for eukaryote signal sequences.
144 Please see the example script located in the bioperl distribution
145 to see how this code can be used.
147 =head1 FEEDBACK
149 When updating and maintaining a module, it helps to know that people
150 are actually using it. Let us know if you find a bug, think this code
151 is useful or have any improvements/features to suggest.
153 =head2 Reporting Bugs
155 Report bugs to the Bioperl bug tracking system to help us keep track the bugs and
156 their resolution. Bug reports can be submitted via email or the web:
158 bioperl-bugs@bio.perl.org
159 http://bio.perl.org/bioperl-bugs/
161 =head1 AUTHOR
163 Chris Dagdigian, dag@sonsorol.org & others
165 =head1 VERSION
167 Bio::Tools::Sigcleave.pm, $Id$
169 =head1 COPYRIGHT
171 Copyright (c) 1999 Chris Dagdigian & others. All Rights Reserved.
172 This module is free software; you can redistribute it and/or
173 modify it under the same terms as Perl itself.
175 =head1 REFERENCES / SEE ALSO
177 von Heijne G. (1986) "A new method for predicting signal sequences cleavage sites."
178 Nucleic Acids Res. 14, 4683-4690.
180 von Heijne G. (1987) in "Sequence Analysis in Molecular Biology: Treasure Trove or Trivial Pursuit"
181 (Acad. Press, (1987), 113-117).
183 EGCG website: http://www.sanger.ac.uk/Software/EGCG/
185 =head1 APPENDIX
187 The following documentation describes the various functions
188 contained in this module. Some functions are for internal
189 use and are not meant to be called by the user; they are
190 preceded by an underscore ("_").
193 =cut
198 #### END of main POD documentation.
203 %Bio::Tools::Sigcleave::WeightTable = (
204 'A' => [16 ,13 ,14 ,15 ,20 ,18 ,18 ,17 ,25 ,15 ,47 , 6 ,80 ,18 , 6 ,14.5],
205 'C' => [3 , 6 , 9 , 7 , 9 ,14 , 6 , 8 , 5 , 6 ,19 , 3 , 9 , 8 , 3 , 4.5],
206 'D' => [0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 5 , 3 , 0 , 5 , 0 ,10 ,11 , 8.9],
207 'E' => [0 , 0 , 0 , 1 , 0 , 0 , 0 , 0 , 3 , 7 , 0 , 7 , 0 ,13 ,14 ,10.0],
208 'F' => [13 , 9 ,11 ,11 , 6 , 7 ,18 ,13 , 4 , 5 , 0 ,13 , 0 , 6 , 4 , 5.6],
209 'G' => [4 , 4 , 3 , 6 , 3 ,13 , 3 , 2 ,19 ,34 , 5 , 7 ,39 ,10 , 7 ,12.1],
210 'H' => [0 , 0 , 0 , 0 , 0 , 1 , 1 , 0 , 5 , 0 , 0 , 6 , 0 , 4 , 2 , 3.4],
211 'I' => [15 ,15 , 8 , 6 ,11 , 5 , 4 , 8 , 5 , 1 ,10 , 5 , 0 , 8 , 7 , 7.4],
212 'K' => [0 , 0 , 0 , 1 , 0 , 0 , 1 , 0 , 0 , 4 , 0 , 2 , 0 ,11 , 9 ,11.3],
213 'L' => [71 ,68 ,72 ,79 ,78 ,45 ,64 ,49 ,10 ,23 , 8 ,20 , 1 , 8 , 4 ,12.1],
214 'M' => [0 , 3 , 7 , 4 , 1 , 6 , 2 , 2 , 0 , 0 , 0 , 1 , 0 , 1 , 2 , 2.7],
215 'N' => [0 , 1 , 0 , 1 , 1 , 0 , 0 , 0 , 3 , 3 , 0 ,10 , 0 , 4 , 7 , 7.1],
216 'P' => [2 , 0 , 2 , 0 , 0 , 4 , 1 , 8 ,20 ,14 , 0 , 1 , 3 , 0 ,22 , 7.4],
217 'Q' => [0 , 0 , 0 , 1 , 0 , 6 , 1 , 0 ,10 , 8 , 0 ,18 , 3 ,19 ,10 , 6.3],
218 'R' => [2 , 0 , 0 , 0 , 0 , 1 , 0 , 0 , 7 , 4 , 0 ,15 , 0 ,12 , 9 , 7.6],
219 'S' => [9 , 3 , 8 , 6 ,13 ,10 ,15 ,16 ,26 ,11 ,23 ,17 ,20 ,15 ,10 ,11.4],
220 'T' => [2 ,10 , 5 , 4 , 5 ,13 , 7 , 7 ,12 , 6 ,17 , 8 , 6 , 3 ,10 , 9.7],
221 'V' => [20 ,25 ,15 ,18 ,13 ,15 ,11 ,27 , 0 ,12 ,32 , 3 , 0 , 8 ,17 ,11.1],
222 'W' => [4 , 3 , 3 , 1 , 1 , 2 , 6 , 3 , 1 , 3 , 0 , 9 , 0 , 2 , 0 , 1.8],
223 'Y' => [0 , 1 , 4 , 0 , 0 , 1 , 3 , 1 , 1 , 2 , 0 , 5 , 0 , 1 , 7 , 5.6],
227 ## Now we calculate the _real_ values for the weight table
230 ## yeah yeah yeah there is lots of math here that gets repeated
231 ## every single time a sigcleave object gets created. This is
232 ## a quick hack to make sure that we get the scores as accurate as
233 ## possible. Need all those significant digits....
235 ## suggestions for speedup aproaches welcome
238 my ($i, $j, $expected);
239 foreach $i (keys %WeightTable)
241 $expected = $WeightTable{$i}[15];
242 if ($expected > 0)
244 for ($j=0; $j<16; $j++)
246 if ($WeightTable{$i}[$j] == 0)
248 $WeightTable{$i}[$j] = 1;
249 if ($j == 10 || $j == 12) { $WeightTable{$i}[$j] = 1.e-10; }
251 $WeightTable{$i}[$j] = log($WeightTable{$i}[$j]/$expected);
259 #####################################################################################
260 ## CONSTRUCTOR ##
261 #####################################################################################
264 =head1 _initialize
266 Title : _initialize
267 Usage : n/a; automatically called by Bio::Root::Object::new()
268 Purpose : Verifies that the type is correct for superclass (Bio::Seq.pm)
269 : and calls superclass constructor last.
270 Returns : n/a
271 Argument : Parameters passed to new()
272 Throws : Exception if given type is not protein.
273 Comments :
274 See Also : B<Bio::Root::Object::new()>, B<Bio::Seq::_initialize()>
276 =cut
278 #----------------
279 sub _initialize {
280 #----------------
281 my($self, %param) = @_;
283 my($threshold,$type) = $self->_rearrange([qw(THRESHOLD
284 TYPE)], %param);
286 my $make = $self->SUPER::_initialize(%param);
288 # complain if not protein
289 if ($type =~ /nuc|[dr]na/i) {
290 $self->throw("Sigcleave.pm only supports protein sequences.");
291 } elsif ($type =~ /amino|pep|prot/i) {
292 $self->{type} = 'amino';
295 # set threshold if supplied, otherwise default to 3.5
296 if (defined $threshold) {
297 $self->{threshold} = $threshold;
298 } else {
299 $self->{threshold} = 3.5;
302 $self->_Analyze;
303 $make;
307 =head1 _Analyze
309 Title : _Analyze
310 Usage : N/A This is an internal method. Not meant to be called from outside
311 : the package
313 Purpose : calculates sigcleave score and amino acid position for the
314 : given protein sequence. The score reporting threshold can
315 : be adjusted by passing in the "threshold" parameter during
316 : object construction. If no threshold is passed in, the code
317 : defaults to reporting any scores equal to or above 3.5
319 Returns : nothing. results are added to the object
320 Argument : none.
321 Throws : nothing.
322 Comments : nothing.
323 See Also : n/a
325 =cut
327 #----------------
328 sub _Analyze {
329 #----------------
330 my($self) = @_;
332 ## need to shut strict() up
334 my %signals;
335 my @hitWeight = ();
336 my @hitsort = ();
337 my @hitpos = ();
338 my $maxSite = "";
339 my $seqPos = "";
340 my $istart = "";
341 my $iend = "";
342 my $icol = "";
343 my $i = "";
344 my $weight = "";
345 my $k = 0;
346 my $c = 0;
347 my $seqBegin = 0;
348 my $pVal = -13;
349 my $nVal = 2;
350 my $nHits = 0;
351 my $seqEnd = $self->seq_len;
352 my $pep = $self->str;
353 my $minWeight = $self->threshold;
355 ## The weight table is keyed by UPPERCASE letters so we uppercase
356 ## the pep string because we don't want to alter the actual object
357 ## sequence.
359 $pep =~ tr/a-z/A-Z/;
361 for($seqPos = $seqBegin; $seqPos < $seqEnd; $seqPos++)
363 $istart = (0 > $seqPos + $pVal)? 0 : $seqPos + $pVal;
364 $iend = ($seqPos + $nVal - 1 < $seqEnd)? $seqPos + $nVal - 1 : $seqEnd;
365 $icol= $iend - $istart + 1;
366 $weight = 0.00;
367 for ($k=0; $k<$icol; $k++)
369 $c = substr($pep, $istart + $k, 1);
371 ## CD: The if(defined) stuff was put in here because Sigcleave.pm
372 ## CD: kept getting warnings about undefined vals during 'make test' ...
373 if(defined $WeightTable{$c}[$k]) { $weight += $WeightTable{$c}[$k]; }
377 if ($weight >= $minWeight)
378 { $signals{$seqPos+1} = sprintf ("%.1f", $weight); }
381 $self->{"signal_scores"} = { %signals };
386 =head1 threshold
388 Title : threshold
389 Usage : $value = $self->threshold
391 Purpose : Accessor method sigcleave score reporting threshold.
393 Returns : float.
395 Argument : none.
396 Throws : none.
397 Comments : none.
398 See Also : n/a
400 =cut
402 #----------------
403 sub threshold {
404 #----------------
405 my $self = shift;
406 return $self->{threshold};
411 =head1 signals
413 Title : signals
414 Usage : %sigcleave_results = $sigcleave_object->signals;
416 Purpose : Accessor method for sigcleave results
418 Returns : Associative array. The key value represents the amino acid position
419 : and the value represents the score. Only scores that
420 : are greater than or equal to the THRESHOLD value are reported.
422 Argument : none.
423 Throws : none.
424 Comments : none.
425 See Also : THRESHOLD
427 =cut
429 #----------------
430 sub signals {
431 #----------------
432 my $self = shift;
433 my %results;
434 my $position;
436 foreach $position ( sort keys %{ $self->{signal_scores} } ) {
437 $results{$position} = $self->{signal_scores}{$position};
440 return %results;
445 =head1 pretty_print
447 Title : pretty_print
448 Usage : $output = $sigcleave_object->pretty_print;
449 : print $sigcleave_object->pretty_print;
451 Purpose : Emulates the output of the EGCG Sigcleave
452 : utility.
454 Returns : A formatted string.
455 Argument : none.
456 Throws : none.
457 Comments : none.
458 See Also : n/a
460 =cut
462 #----------------
463 sub pretty_print {
464 #----------------
465 my $self = shift;
466 my $pos;
467 my $output;
468 my $cnt = 1;
469 my %results = $self->signals;
470 my @hits = keys %results;
471 my $hitcount = $#hits; $hitcount++;
472 my $thresh = $self->threshold;
473 my $seqlen = $self->seq_len;
474 my $name = $self->id;
475 my $pep = $self->str;
476 $pep =~ tr/a-z/A-Z/;
478 $output = "SIGCLEAVE of $name from: 1 to $seqlen\n\n";
480 if($hitcount > 0) {
481 $output .= "Report scores over $thresh\n";
482 foreach $pos ((sort { $results{$b} cmp $results{$a} } keys %results)) {
483 my $start = $pos - 13;
484 $output .= sprintf ("Maximum score %1.1f at residue %3d\n",$results{$pos},$pos);
485 $output .= "\n";
486 $output .= " Sequence: ";
487 $output .= substr($pep,$start,13);
488 $output .= "-";
489 $output .= substr($pep,$pos,50);
490 $output .= "\n";
491 $output .= " | \(signal\) | \(mature peptide\)\n";
492 $output .= sprintf(" %3d %3d\n\n",$start,$pos);
494 if(($hitcount > 1) && ($cnt == 1)) {
495 $output .= " Other entries above $thresh\n\n";
497 $cnt++;
500 $output;
505 __END__
508 #########################################################################
509 # End of class
510 #########################################################################