tag fourth (and hopefully last) alpha
[bioperl-live.git] / branch-1-6 / Bio / Matrix / PSM / SiteMatrixI.pm
blobab33d8db54b1dbf427aa07f26ddc04f96b6c5e43
1 # $Id$
3 =head1 NAME
5 Bio::Matrix::PSM::SiteMatrixI - SiteMatrixI implementation, holds a
6 position scoring matrix (or position weight matrix) and log-odds
8 =head1 SYNOPSIS
10 # You cannot use this module directly; see Bio::Matrix::PSM::SiteMatrix
11 # for an example implementation
13 =head1 DESCRIPTION
15 SiteMatrix is designed to provide some basic methods when working with position
16 scoring (weight) matrices, such as transcription factor binding sites for
17 example. A DNA PSM consists of four vectors with frequencies {A,C,G,T}. This is
18 the minimum information you should provide to construct a PSM object. The
19 vectors can be provided as strings with frequenciesx10 rounded to an int, going
20 from {0..a} and 'a' represents the maximum (10). This is like MEME's compressed
21 representation of a matrix and it is quite useful when working with relational
22 DB. If arrays are provided as an input (references to arrays actually) they can
23 be any number, real or integer (frequency or count).
25 When creating the object you can ask the constructor to make a simple pseudo
26 count correction by adding a number (typically 1) to all positions (with the
27 -correction option). After adding the number the frequencies will be
28 calculated. Only use correction when you supply counts, not frequencies.
30 Throws an exception if: You mix as an input array and string (for example A
31 matrix is given as array, C - as string). The position vector is (0,0,0,0). One
32 of the probability vectors is shorter than the rest.
34 Summary of the methods I use most frequently (details bellow):
36 iupac - return IUPAC compliant consensus as a string
37 score - Returns the score as a real number
38 IC - information content. Returns a real number
39 id - identifier. Returns a string
40 accession - accession number. Returns a string
41 next_pos - return the sequence probably for each letter, IUPAC
42 symbol, IUPAC probability and simple sequence
43 consenus letter for this position. Rewind at the end. Returns a hash.
44 pos - current position get/set. Returns an integer.
45 regexp - construct a regular expression based on IUPAC consensus.
46 For example AGWV will be [Aa][Gg][AaTt][AaCcGg]
47 width - site width
48 get_string - gets the probability vector for a single base as a string.
49 get_array - gets the probability vector for a single base as an array.
50 get_logs_array - gets the log-odds vector for a single base as an array.
52 New methods, which might be of interest to anyone who wants to store PSM in a relational
53 database without creating an entry for each position is the ability to compress the
54 PSM vector into a string with losing usually less than 1% of the data.
55 this can be done with:
57 my $str=$matrix->get_compressed_freq('A');
61 my $str=$matrix->get_compressed_logs('A');
63 Loading from a database should be done with new, but is not yest implemented.
64 However you can still uncompress such string with:
66 my @arr=Bio::Matrix::PSM::_uncompress_string ($str,1,1); for PSM
70 my @arr=Bio::Matrix::PSM::_uncompress_string ($str,1000,2); for log odds
72 =head1 FEEDBACK
74 =head2 Mailing Lists
76 User feedback is an integral part of the evolution of this and other
77 Bioperl modules. Send your comments and suggestions preferably to one
78 of the Bioperl mailing lists. Your participation is much appreciated.
80 bioperl-l@bioperl.org - General discussion
81 http://bioperl.org/wiki/Mailing_lists - About the mailing lists
83 =head2 Support
85 Please direct usage questions or support issues to the mailing list:
87 I<bioperl-l@bioperl.org>
89 rather than to the module maintainer directly. Many experienced and
90 reponsive experts will be able look at the problem and quickly
91 address it. Please include a thorough description of the problem
92 with code and data examples if at all possible.
94 =head2 Reporting Bugs
96 Report bugs to the Bioperl bug tracking system to help us keep track
97 the bugs and their resolution. Bug reports can be submitted via the
98 web:
100 http://bugzilla.open-bio.org/
102 =head1 AUTHOR - Stefan Kirov
104 Email skirov@utk.edu
106 =head1 APPENDIX
108 =cut
111 # Let the code begin...
113 package Bio::Matrix::PSM::SiteMatrixI;
115 # use strict;
116 use base qw(Bio::Root::RootI);
118 =head2 calc_weight
120 Title : calc_weight
121 Usage : $self->calc_weight({A=>0.2562,C=>0.2438,G=>0.2432,T=>0.2568});
122 Function: Recalculates the PSM (or weights) based on the PFM (the frequency matrix)
123 and user supplied background model.
124 Throws : if no model is supplied
125 Example :
126 Returns :
127 Args : reference to a hash with background frequencies for A,C,G and T
129 =cut
131 sub calc_weight {
132 my $self = shift;
133 $self->throw_not_implemented();
137 =head2 next_pos
139 Title : next_pos
140 Usage : my %base=$site->next_pos;
141 Function:
143 Retrieves the next position features: frequencies and weights for
144 A,C,G,T, the main letter (as in consensus) and the
145 probabilty for this letter to occur at this position and
146 the current position
148 Throws :
149 Example :
150 Returns : hash (pA,pC,pG,pT,lA,lC,lG,lT,base,prob,rel)
151 Args : none
154 =cut
156 sub next_pos {
157 my $self = shift;
158 $self->throw_not_implemented();
161 =head2 curpos
163 Title : curpos
164 Usage : my $pos=$site->curpos;
165 Function: Gets/sets the current position. Converts to 0 if argument is minus and
166 to width if greater than width
167 Throws :
168 Example :
169 Returns : integer
170 Args : integer
172 =cut
174 sub curpos {
175 my $self = shift;
176 $self->throw_not_implemented();
179 =head2 e_val
181 Title : e_val
182 Usage : my $score=$site->e_val;
183 Function: Gets/sets the e-value
184 Throws :
185 Example :
186 Returns : real number
187 Args : real number
189 =cut
191 sub e_val {
192 my $self = shift;
193 $self->throw_not_implemented();
196 =head2 consensus
198 Title : consensus
199 Usage :
200 Function: Returns the consensus
201 Returns : string
202 Args : (optional) threshold value 1 to 10, default 5
203 '5' means the returned characters had a 50% or higher presence at
204 their position
206 =cut
208 sub consensus {
209 my $self = shift;
210 $self->throw_not_implemented();
213 =head2 accession_number
215 Title : accession_number
216 Usage :
217 Function: accession number, this will be unique id for the SiteMatrix object as
218 well for any other object, inheriting from SiteMatrix
219 Throws :
220 Example :
221 Returns : string
222 Args : string
224 =cut
226 sub accession_number {
227 my $self = shift;
228 $self->throw_not_implemented();
232 =head2 width
234 Title : width
235 Usage : my $width=$site->width;
236 Function: Returns the length of the site
237 Throws :
238 Example :
239 Returns : number
240 Args :
242 =cut
244 sub width {
245 my $self = shift;
246 $self->throw_not_implemented();
249 =head2 IUPAC
251 Title : IUPAC
252 Usage : my $iupac_consensus=$site->IUPAC;
253 Function: Returns IUPAC compliant consensus
254 Throws :
255 Example :
256 Returns : string
257 Args :
259 =cut
261 sub IUPAC {
262 my $self = shift;
263 $self->throw_not_implemented();
266 =head2 IC
268 Title : IC
269 Usage : my $ic=$site->IC;
270 Function: Information content
271 Throws :
272 Example :
273 Returns : real number
274 Args : none
276 =cut
278 sub IC {
279 my $self=shift;
280 $self->throw_not_implemented();
283 =head2 get_string
285 Title : get_string
286 Usage : my $freq_A=$site->get_string('A');
287 Function: Returns given probability vector as a string. Useful if you want to
288 store things in a rel database, where arrays are not first choice
289 Throws : If the argument is outside {A,C,G,T}
290 Example :
291 Returns : string
292 Args : character {A,C,G,T}
294 =cut
296 sub get_string {
297 my $self=shift;
298 $self->throw_not_implemented();
301 =head2 id
303 Title : id
304 Usage : my $id=$site->id;
305 Function: Gets/sets the site id
306 Throws :
307 Example :
308 Returns : string
309 Args : string
311 =cut
313 sub id {
314 my $self = shift;
315 $self->throw_not_implemented();
318 =head2 regexp
320 Title : regexp
321 Usage : my $regexp=$site->regexp;
322 Function: Returns a regular expression which matches the IUPAC convention.
323 N will match X, N, - and .
324 Throws :
325 Example :
326 Returns : string
327 Args :
329 =cut
331 sub regexp {
332 my $self=shift;
333 $self->throw_not_implemented();
336 =head2 regexp_array
338 Title : regexp_array
339 Usage : my @regexp=$site->regexp;
340 Function: Returns a regular expression which matches the IUPAC convention.
341 N will match X, N, - and .
342 Throws :
343 Example :
344 Returns : array
345 Args :
346 To do : I have separated regexp and regexp_array, but
347 maybe they can be rewritten as one - just check what
348 should be returned
350 =cut
352 sub regexp_array {
353 my $self=shift;
354 $self->throw_not_implemented();
357 =head2 get_array
359 Title : get_array
360 Usage : my @freq_A=$site->get_array('A');
361 Function: Returns an array with frequencies for a specified base
362 Throws :
363 Example :
364 Returns : array
365 Args : char
367 =cut
369 sub get_array {
370 my $self=shift;
371 $self->throw_not_implemented();
375 =head2 _to_IUPAC
377 Title : _to_IUPAC
378 Usage :
379 Function: Converts a single position to IUPAC compliant symbol and
380 returns its probability. For rules see the implementation.
381 Throws :
382 Example :
383 Returns : char, real number
384 Args : real numbers for A,C,G,T (positional)
386 =cut
388 sub _to_IUPAC {
389 my $self = shift;
390 $self->throw_not_implemented();
393 =head2 _to_cons
395 Title : _to_cons
396 Usage :
397 Function: Converts a single position to simple consensus character and
398 returns its probability. For rules see the implementation,
399 Throws :
400 Example :
401 Returns : char, real number
402 Args : real numbers for A,C,G,T (positional)
404 =cut
406 sub _to_cons {
407 my $self = shift;
408 $self->throw_not_implemented();
412 =head2 _calculate_consensus
414 Title : _calculate_consensus
415 Usage :
416 Function: Internal stuff
417 Throws :
418 Example :
419 Returns :
420 Args :
422 =cut
424 sub _calculate_consensus {
425 my $self = shift;
426 $self->throw_not_implemented();
429 =head2 _compress_array
431 Title : _compress_array
432 Usage :
433 Function: Will compress an array of real signed numbers to a string (ie vector of bytes)
434 -127 to +127 for bi-directional(signed) and 0..255 for unsigned ;
435 Throws :
436 Example : Internal stuff
437 Returns : String
438 Args : array reference, followed by an max value and
439 direction (optional, default 1-unsigned),1 unsigned, any other is signed.
441 =cut
443 sub _compress_array {
444 my $self = shift;
445 $self->throw_not_implemented();
448 =head2 _uncompress_string
450 Title : _uncompress_string
451 Usage :
452 Function: Will uncompress a string (vector of bytes) to create an array of real
453 signed numbers (opposite to_compress_array)
454 Throws :
455 Example : Internal stuff
456 Returns : string, followed by an max value and
457 direction (optional, default 1-unsigned), 1 unsigned, any other is signed.
458 Args : array
460 =cut
462 sub _uncompress_string {
463 my $self = shift;
464 $self->throw_not_implemented();
467 =head2 get_compressed_freq
469 Title : get_compressed_freq
470 Usage :
471 Function: A method to provide a compressed frequency vector. It uses one byte to
472 code the frequence for one of the probability vectors for one position.
473 Useful for relational database. Improvment of the previous 0..a coding.
474 Throws :
475 Example : my $strA=$self->get_compressed_freq('A');
476 Returns : String
477 Args : char
479 =cut
481 sub get_compressed_freq {
482 my $self = shift;
483 $self->throw_not_implemented();
486 =head2 get_compressed_logs
488 Title : get_compressed_logs
489 Usage :
490 Function: A method to provide a compressed log-odd vector. It uses one byte to
491 code the log value for one of the log-odds vectors for one position.
492 Throws :
493 Example : my $strA=$self->get_compressed_logs('A');
494 Returns : String
495 Args : char
497 =cut
499 sub get_compressed_logs {
500 my $self = shift;
501 $self->throw_not_implemented();
504 =head2 sequence_match_weight
506 Title : sequence_match_weight
507 Usage :
508 Function: This method will calculate the score of a match, based on the PWM
509 if such is associated with the matrix object. Returns undef if no
510 PWM data is available.
511 Throws : if the length of the sequence is different from the matrix width
512 Example : my $score=$matrix->sequence_match_weight('ACGGATAG');
513 Returns : Floating point
514 Args : string
516 =cut
518 sub sequence_match_weight {
519 my $self = shift;
520 $self->throw_not_implemented();
523 =head2 get_all_vectors
525 Title : get_all_vectors
526 Usage :
527 Function: returns all possible sequence vectors to satisfy the PFM under
528 a given threshold
529 Throws : If threshold outside of 0..1 (no sense to do that)
530 Example : my @vectors=$self->get_all_vectors(4);
531 Returns : Array of strings
532 Args : (optional) floating
534 =cut
536 sub get_all_vectors {
537 my $self = shift;
538 $self->throw_not_implemented();