5 Bio::Matrix::PSM::SiteMatrixI - SiteMatrixI implementation, holds a
6 position scoring matrix (or position weight matrix) and log-odds
10 # You cannot use this module directly; see Bio::Matrix::PSM::SiteMatrix
11 # for an example implementation
15 SiteMatrix is designed to provide some basic methods when working with position
16 scoring (weight) matrices, such as transcription factor binding sites for
17 example. A DNA PSM consists of four vectors with frequencies {A,C,G,T}. This is
18 the minimum information you should provide to construct a PSM object. The
19 vectors can be provided as strings with frequenciesx10 rounded to an int, going
20 from {0..a} and 'a' represents the maximum (10). This is like MEME's compressed
21 representation of a matrix and it is quite useful when working with relational
22 DB. If arrays are provided as an input (references to arrays actually) they can
23 be any number, real or integer (frequency or count).
25 When creating the object you can ask the constructor to make a simple pseudo
26 count correction by adding a number (typically 1) to all positions (with the
27 -correction option). After adding the number the frequencies will be
28 calculated. Only use correction when you supply counts, not frequencies.
30 Throws an exception if: You mix as an input array and string (for example A
31 matrix is given as array, C - as string). The position vector is (0,0,0,0). One
32 of the probability vectors is shorter than the rest.
34 Summary of the methods I use most frequently (details bellow):
36 iupac - return IUPAC compliant consensus as a string
37 score - Returns the score as a real number
38 IC - information content. Returns a real number
39 id - identifier. Returns a string
40 accession - accession number. Returns a string
41 next_pos - return the sequence probably for each letter, IUPAC
42 symbol, IUPAC probability and simple sequence
43 consenus letter for this position. Rewind at the end. Returns a hash.
44 pos - current position get/set. Returns an integer.
45 regexp - construct a regular expression based on IUPAC consensus.
46 For example AGWV will be [Aa][Gg][AaTt][AaCcGg]
48 get_string - gets the probability vector for a single base as a string.
49 get_array - gets the probability vector for a single base as an array.
50 get_logs_array - gets the log-odds vector for a single base as an array.
52 New methods, which might be of interest to anyone who wants to store PSM in a relational
53 database without creating an entry for each position is the ability to compress the
54 PSM vector into a string with losing usually less than 1% of the data.
55 this can be done with:
57 my $str=$matrix->get_compressed_freq('A');
61 my $str=$matrix->get_compressed_logs('A');
63 Loading from a database should be done with new, but is not yest implemented.
64 However you can still uncompress such string with:
66 my @arr=Bio::Matrix::PSM::_uncompress_string ($str,1,1); for PSM
70 my @arr=Bio::Matrix::PSM::_uncompress_string ($str,1000,2); for log odds
76 User feedback is an integral part of the evolution of this and other
77 Bioperl modules. Send your comments and suggestions preferably to one
78 of the Bioperl mailing lists. Your participation is much appreciated.
80 bioperl-l@bioperl.org - General discussion
81 http://bioperl.org/wiki/Mailing_lists - About the mailing lists
85 Please direct usage questions or support issues to the mailing list:
87 I<bioperl-l@bioperl.org>
89 rather than to the module maintainer directly. Many experienced and
90 reponsive experts will be able look at the problem and quickly
91 address it. Please include a thorough description of the problem
92 with code and data examples if at all possible.
96 Report bugs to the Bioperl bug tracking system to help us keep track
97 the bugs and their resolution. Bug reports can be submitted via the
100 http://bugzilla.open-bio.org/
102 =head1 AUTHOR - Stefan Kirov
111 # Let the code begin...
113 package Bio
::Matrix
::PSM
::SiteMatrixI
;
116 use base
qw(Bio::Root::RootI);
121 Usage : $self->calc_weight({A=>0.2562,C=>0.2438,G=>0.2432,T=>0.2568});
122 Function: Recalculates the PSM (or weights) based on the PFM (the frequency matrix)
123 and user supplied background model.
124 Throws : if no model is supplied
127 Args : reference to a hash with background frequencies for A,C,G and T
133 $self->throw_not_implemented();
140 Usage : my %base=$site->next_pos;
143 Retrieves the next position features: frequencies and weights for
144 A,C,G,T, the main letter (as in consensus) and the
145 probabilty for this letter to occur at this position and
150 Returns : hash (pA,pC,pG,pT,lA,lC,lG,lT,base,prob,rel)
158 $self->throw_not_implemented();
164 Usage : my $pos=$site->curpos;
165 Function: Gets/sets the current position. Converts to 0 if argument is minus and
166 to width if greater than width
176 $self->throw_not_implemented();
182 Usage : my $score=$site->e_val;
183 Function: Gets/sets the e-value
186 Returns : real number
193 $self->throw_not_implemented();
200 Function: Returns the consensus
202 Args : (optional) threshold value 1 to 10, default 5
203 '5' means the returned characters had a 50% or higher presence at
210 $self->throw_not_implemented();
213 =head2 accession_number
215 Title : accession_number
217 Function: accession number, this will be unique id for the SiteMatrix object as
218 well for any other object, inheriting from SiteMatrix
226 sub accession_number
{
228 $self->throw_not_implemented();
235 Usage : my $width=$site->width;
236 Function: Returns the length of the site
246 $self->throw_not_implemented();
252 Usage : my $iupac_consensus=$site->IUPAC;
253 Function: Returns IUPAC compliant consensus
263 $self->throw_not_implemented();
269 Usage : my $ic=$site->IC;
270 Function: Information content
273 Returns : real number
280 $self->throw_not_implemented();
286 Usage : my $freq_A=$site->get_string('A');
287 Function: Returns given probability vector as a string. Useful if you want to
288 store things in a rel database, where arrays are not first choice
289 Throws : If the argument is outside {A,C,G,T}
292 Args : character {A,C,G,T}
298 $self->throw_not_implemented();
304 Usage : my $id=$site->id;
305 Function: Gets/sets the site id
315 $self->throw_not_implemented();
321 Usage : my $regexp=$site->regexp;
322 Function: Returns a regular expression which matches the IUPAC convention.
323 N will match X, N, - and .
333 $self->throw_not_implemented();
339 Usage : my @regexp=$site->regexp;
340 Function: Returns a regular expression which matches the IUPAC convention.
341 N will match X, N, - and .
346 To do : I have separated regexp and regexp_array, but
347 maybe they can be rewritten as one - just check what
354 $self->throw_not_implemented();
360 Usage : my @freq_A=$site->get_array('A');
361 Function: Returns an array with frequencies for a specified base
371 $self->throw_not_implemented();
379 Function: Converts a single position to IUPAC compliant symbol and
380 returns its probability. For rules see the implementation.
383 Returns : char, real number
384 Args : real numbers for A,C,G,T (positional)
390 $self->throw_not_implemented();
397 Function: Converts a single position to simple consensus character and
398 returns its probability. For rules see the implementation,
401 Returns : char, real number
402 Args : real numbers for A,C,G,T (positional)
408 $self->throw_not_implemented();
412 =head2 _calculate_consensus
414 Title : _calculate_consensus
416 Function: Internal stuff
424 sub _calculate_consensus
{
426 $self->throw_not_implemented();
429 =head2 _compress_array
431 Title : _compress_array
433 Function: Will compress an array of real signed numbers to a string (ie vector of bytes)
434 -127 to +127 for bi-directional(signed) and 0..255 for unsigned ;
436 Example : Internal stuff
438 Args : array reference, followed by an max value and
439 direction (optional, default 1-unsigned),1 unsigned, any other is signed.
443 sub _compress_array
{
445 $self->throw_not_implemented();
448 =head2 _uncompress_string
450 Title : _uncompress_string
452 Function: Will uncompress a string (vector of bytes) to create an array of real
453 signed numbers (opposite to_compress_array)
455 Example : Internal stuff
456 Returns : string, followed by an max value and
457 direction (optional, default 1-unsigned), 1 unsigned, any other is signed.
462 sub _uncompress_string
{
464 $self->throw_not_implemented();
467 =head2 get_compressed_freq
469 Title : get_compressed_freq
471 Function: A method to provide a compressed frequency vector. It uses one byte to
472 code the frequence for one of the probability vectors for one position.
473 Useful for relational database. Improvment of the previous 0..a coding.
475 Example : my $strA=$self->get_compressed_freq('A');
481 sub get_compressed_freq
{
483 $self->throw_not_implemented();
486 =head2 get_compressed_logs
488 Title : get_compressed_logs
490 Function: A method to provide a compressed log-odd vector. It uses one byte to
491 code the log value for one of the log-odds vectors for one position.
493 Example : my $strA=$self->get_compressed_logs('A');
499 sub get_compressed_logs
{
501 $self->throw_not_implemented();
504 =head2 sequence_match_weight
506 Title : sequence_match_weight
508 Function: This method will calculate the score of a match, based on the PWM
509 if such is associated with the matrix object. Returns undef if no
510 PWM data is available.
511 Throws : if the length of the sequence is different from the matrix width
512 Example : my $score=$matrix->sequence_match_weight('ACGGATAG');
513 Returns : Floating point
518 sub sequence_match_weight
{
520 $self->throw_not_implemented();
523 =head2 get_all_vectors
525 Title : get_all_vectors
527 Function: returns all possible sequence vectors to satisfy the PFM under
529 Throws : If threshold outside of 0..1 (no sense to do that)
530 Example : my @vectors=$self->get_all_vectors(4);
531 Returns : Array of strings
532 Args : (optional) floating
536 sub get_all_vectors
{
538 $self->throw_not_implemented();