bug 2549; fixed small bug in Bio::Taxon which doesn't catch -common_name
[bioperl-live.git] / Bio / PopGen / HtSNP.pm
blob7a07aed494d7134d117555a751cdd00f9d0a5c90
1 # module Bio::PopGen::HtSNP.pm
2 # cared by Pedro M. Gomez-Fabre <pgf18872-at-gsk-dot-com>
6 =head1 NAME
8 Bio::PopGen::HtSNP.pm- Select htSNP from a haplotype set
10 =head1 SYNOPSIS
12 use Bio::PopGen::HtSNP;
14 my $obj = Bio::PopGen::HtSNP->new($hap,$snp,$pop);
16 =head1 DESCRIPTION
18 Select the minimal set of SNP that contains the full information about
19 the haplotype without redundancies.
21 Take as input the followin values:
23 =over 4
25 =item - the haplotype block (array of array).
27 =item - the snp id (array).
29 =item - family information and frequency (array of array).
31 =back
33 The final haplotype is generated in a numerical format and the SNP's
34 sets can be retrieve from the module.
36 B<considerations:>
39 - If you force to include a family with indetermination, the SNP's
40 with indetermination will be removed from the analysis, so consider
41 before to place your data set what do you really want to do.
43 - If two families have the same information (identical haplotype), one
44 of them will be removed and the removed files will be stored classify
45 as removed.
47 - Only are accepted for calculation A, C, G, T and - (as deletion) and
48 their combinations. Any other value as n or ? will be considered as
49 degenerations due to lack of information.
51 =head2 RATIONALE
53 On a haplotype set is expected that some of the SNP and their
54 variations contribute in the same way to the haplotype. Eliminating
55 redundancies will produce a minimal set of SNP's that can be used as
56 input for a taging selection process. On the process SNP's with the
57 same variation are clustered on the same group.
59 The idea is that because the tagging haplotype process is
60 exponential. All redundant information we could eliminate on the
61 tagging process will help to find a quick result.
63 =head2 CONSTRUCTORS
65 my $obj = Bio::PopGen::HtSNP->new
66 (-haplotype_block => \@haplotype_patterns,
67 -snp_ids => \@snp_ids,
68 -pattern_freq => \@pattern_name_and_freq);
70 where $hap, $snp and $pop are in the format:
72 my $hap = [
73 'acgt',
74 'agtc',
75 'cgtc'
76 ]; # haplotype patterns' id
78 my $snp = [qw/s1 s2 s3 s4/]; # snps' Id's
80 my $pop = [
81 [qw/ uno 0.20/],
82 [qw/ dos 0.20/],
83 [qw/ tres 0.15/],
84 ]; # haplotype_pattern_id Frequency
86 =head2 OBJECT METHODS
88 See Below for more detailed summaries.
91 =head1 DETAILS
93 =head2 How the process is working with one example
95 Let's begin with one general example of the code.
97 Input haplotype:
99 acgtcca-t
100 cggtagtgc
101 cccccgtgc
102 cgctcgtgc
104 The first thing to to is to B<split the haplotype> into characters.
106 a c g t c c a - t
107 c g g t a g t g c
108 c c c c c g t g c
109 c g c t c g t g c
111 Now we have to B<convert> the haplotype to B<Upercase>. This
112 will produce the same SNP if we have input a or A.
114 A C G T C C A - T
115 C G G T A G T G C
116 C C C C C G T G C
117 C G C T C G T G C
119 The program admit as values any combination of ACTG and - (deletions).
120 The haplotype is B<converted to number>, considering the first variation
121 as zero and the alternate value as 1 (see expanded description below).
123 0 0 0 0 0 0 0 0 0
124 1 1 0 0 1 1 1 1 1
125 1 0 1 1 0 1 1 1 1
126 1 1 1 0 0 1 1 1 1
128 Once we have the haplotype converted to numbers we have to generate the
129 snp type information for the haplotype.
132 B<SNP code = SUM ( value * multiplicity ^ position );>
134 where:
135 SUM is the sum of the values for the SNP
136 value is the SNP number code (0 [generally for the mayor allele],
137 1 [for the minor allele].
138 position is the position on the block.
140 For this example the code is:
142 0 0 0 0 0 0 0 0 0
143 1 1 0 0 1 1 1 1 1
144 1 0 1 1 0 1 1 1 1
145 1 1 1 0 0 1 1 1 1
146 ------------------------------------------------------------------
147 14 10 12 4 2 14 14 14 14
149 14 = 0*2^0 + 1*2^1 + 1*2^2 + 1*2^3
150 12 = 0*2^0 + 1*2^1 + 0*2^2 + 1*2^3
151 ....
153 Once we have the families classify. We will B<take> just the SNP's B<not
154 redundant>.
156 14 10 12 4 2
158 This information will be B<passed to the tag module> is you want to tag
159 the htSNP.
161 Whatever it happens to one SNPs of a class will happen to a SNP of
162 the same class. Therefore you don't need to scan redundancies
164 =head2 Working with fuzzy data.
166 This module is designed to work with fuzzy data. As the source of the
167 haplotype is diverse. The program assume that some haplotypes can be
168 generated using different values. If there is any indetermination (? or n)
169 or any other degenerated value or invalid. The program will take away
170 This SNP and will leave that for a further analysis.
172 On a complex situation:
174 a c g t ? c a c t
175 a c g t ? c a - t
176 c g ? t a g ? g c
177 c a c t c g t g c
178 c g c t c g t g c
179 c g g t a g ? g c
180 a c ? t ? c a c t
182 On this haplotype everything is happening. We have a multialelic variance.
183 We have indeterminations. We have deletions and we have even one SNP
184 which is not a real SNP.
186 The buiding process will be the same on this situation.
188 Convert the haplotype to uppercase.
190 A C G T ? C A C T
191 A C G T ? C A - T
192 C G ? T A G ? G C
193 C A C T C G T G C
194 C G C T C G T G C
195 C G G T A G ? G C
196 A C ? T ? C A C T
198 All columns that present indeterminations will be removed from the analysis
199 on this Step.
201 hapotype after remove columns:
203 A C T C C T
204 A C T C - T
205 C G T G G C
206 C A T G G C
207 C G T G G C
208 C G T G G C
209 A C T C C T
211 All changes made on the haplotype matrix, will be also made on the SNP list.
213 snp_id_1 snp_id_2 snp_id_4 snp_id_6 snp_id_8 snp_id_9
215 now the SNP that is not one SNP will be removed from the analysis.
216 SNP with Id snp_id_4 (the one with all T's).
219 because of the removing. Some of the families will become the same and will
220 be clustered. A posteriori analysis will diference these families.
221 but because of the indetermination can not be distinguish.
223 A C C C T
224 A C C - T
225 C G G G C
226 C A G G C
227 C G G G C
228 C G G G C
229 A C C C T
231 The result of the mergering will go like:
233 A C C C T
234 A C C - T
235 C G G G C
236 C A G G C
238 Once again the changes made on the families and we merge the frequency (I<to be
239 implemented>)
241 Before to convert the haplotype into numbers we consider how many variations
242 we have on the set. On this case the variations are 3.
244 The control code will use on this situation base three as mutiplicity
246 0 0 0 0 0
247 0 0 0 1 0
248 1 1 1 2 1
249 1 2 1 2 1
250 -----------------------------------
251 36 63 36 75 36
253 And the minimal set for this combination is
255 0 0 0
256 0 0 1
257 1 1 2
258 1 2 2
260 B<NOTE:> this second example is a remote example an on normal conditions. This
261 conditions makes no sense, but as the haplotypes, can come from many sources
262 we have to be ready for all kind of combinations.
265 =head1 FEEDBACK
267 =head2 Mailing Lists
269 User feedback is an integral part of the evolution of this and other
270 Bioperl modules. Send your comments and suggestions preferably to
271 the Bioperl mailing list. Your participation is much appreciated.
273 bioperl-l@bioperl.org - General discussion
274 http://bioperl.org/wiki/Mailing_lists - About the mailing lists
276 =head2 Reporting Bugs
278 Report bugs to the Bioperl bug tracking system to help us keep track
279 of the bugs and their resolution. Bug reports can be submitted via
280 the web:
282 http://bugzilla.open-bio.org/
284 =head1 AUTHOR - Pedro M. Gomez-Fabre
286 Email pgf18872-at-gsk-dot-com
289 =head1 APPENDIX
291 The rest of the documentation details each of the object methods.
292 Internal methods are usually preceded with a _
294 =cut
296 # Let the code begin...
298 package Bio::PopGen::HtSNP;
299 use Data::Dumper;
300 use Storable qw(dclone);
302 use vars qw ();
303 use strict;
306 use base qw(Bio::Root::Root);
308 my $USAGE = 'Usage:
310 Bio::PopGen::HtSNP->new(-haplotype_block -ids -pattern_freq)
314 =head2 new
316 Title : new
317 Function: constructor of the class.
318 Usage : $obj-> Bio::PopGen::HtSNP->new(-haplotype_block
319 -snp_ids
320 -pattern_freq)
321 Returns : self hash
322 Args : input haplotype (array of array)
323 snp_ids (array)
324 pop_freq (array of array)
325 Status : public
327 =cut
329 sub new {
330 my($class, @args) = @_;
332 my $self = $class->SUPER::new(@args);
333 my ($haplotype_block,
334 $snp_ids,
335 $pattern_freq ) = $self->_rearrange([qw(HAPLOTYPE_BLOCK
336 SNP_IDS
337 PATTERN_FREQ)],@args);
339 if ($haplotype_block){
340 $self->haplotype_block($haplotype_block);
342 else{
343 $self->throw("Haplotype block has not been defined.
344 \n$USAGE");
346 if ($snp_ids){
347 $self->snp_ids($snp_ids);
349 else{
350 $self->throw("Array with ids has not been defined.
351 \n$USAGE");
353 if ($pattern_freq){
354 $self->pattern_freq($pattern_freq);
356 else{
357 $self->throw("Array with pattern id and frequency has not been defined.
358 \n$USAGE");
361 # if the input values are not well formed complained and exit.
362 _check_input($self);
364 _do_it($self);
366 return $self;
369 =head2 haplotype_block
371 Title : haplotype_block
372 Usage : my $haplotype_block = $HtSNP->haplotype_block();
373 Function: Get the haplotype block for a haplotype tagging selection
374 Returns : reference of array
375 Args : reference of array with haplotype pattern
378 =cut
380 sub haplotype_block{
381 my ($self) =shift;
382 return $self->{'_haplotype_block'} = shift if @_;
383 return $self->{'_haplotype_block'};
386 =head2 snp_ids
388 Title : snp_ids
389 Usage : my $snp_ids = $HtSNP->$snp_ids();
390 Function: Get the ids for a haplotype tagging selection
391 Returns : reference of array
392 Args : reference of array with SNP ids
395 =cut
397 sub snp_ids{
398 my ($self) =shift;
399 return $self->{'_snp_ids'} = shift if @_;
400 return $self->{'_snp_ids'};
404 =head2 pattern_freq
406 Title : pattern_freq
407 Usage : my $pattern_freq = $HtSNP->pattern_freq();
408 Function: Get the pattern id and frequency for a haplotype
409 tagging selection
410 Returns : reference of array
411 Args : reference of array with SNP ids
413 =cut
415 sub pattern_freq{
416 my ($self) =shift;
417 return $self->{'_pattern_freq'} = shift if @_;
418 return $self->{'_pattern_freq'};
421 =head2 _check_input
423 Title : _check_input
424 Usage : _check_input($self)
425 Function: check for errors on the input
426 Returns : self hash
427 Args : self
428 Status : internal
430 =cut
432 #------------------------
433 sub _check_input{
434 #------------------------
436 my $self = shift;
438 _haplotype_length_error($self);
439 _population_error($self);
443 =head2 _haplotype_length_error
445 Title : _haplotype_length_error
446 Usage : _haplotype_length_error($self)
447 Function: check if the haplotype length is the same that the one on the
448 SNP id list. If not break and exit
449 Returns : self hash
450 Args : self
451 Status : internal
453 =cut
456 #------------------------
457 sub _haplotype_length_error{
458 #------------------------
460 my $self = shift;
462 my $input_block = $self->haplotype_block();
463 my $snp_ids = $self->snp_ids();
466 #############################
467 # define error list
468 #############################
469 my $different_haplotype_length = 0;
471 ##############################
472 # get parameters used to find
473 # the errors
474 ##############################
476 my $snp_number = scalar @$snp_ids;
477 my $number_of_families = scalar @$input_block;
478 my $h = 0; # haplotype position
481 ############################
482 # haplotype length
484 # if the length differs from the number of ids
485 ############################
487 for ($h=0; $h<$#$input_block+1 ; $h++){
488 if (length $input_block->[$h] != $snp_number){
489 $different_haplotype_length = 1;
490 last;
494 # haploytypes does not have the same length
495 if ($different_haplotype_length){
496 $self->throw("The number of snp ids is $snp_number and ".
497 "the length of the family (". ($h+1) .") [".
498 $input_block->[$h]."] is ".
499 length $input_block->[$h], "\n");
503 =head2 _population_error
506 Title : _population_error
507 Usage : _population_error($self)
508 Function: use input_block and pop_freq test if the number of elements
509 match. If doesn't break and quit.
510 Returns : self hash
511 Args : self
512 Status : internal
514 =cut
517 #------------------------
518 sub _population_error{
519 #------------------------
521 my $self = shift;
523 my $input_block = $self->haplotype_block();
524 my $pop_freq = $self->pattern_freq();
526 #############################
527 # define error list
528 #############################
529 my $pop_freq_elements_error = 0; # matrix bad formed
531 ##############################
532 # get parameters used to find
533 # the errors
534 ##############################
535 my $number_of_families = scalar @$input_block;
537 my $pf = 0; # number of elements on population frequency
538 my $frequency = 0; # population frequency
539 my $p_f_length = 0;
541 # check if the pop_freq array is well formed and if the number
542 # of elements fit with the number of families
544 #############################
545 # check population frequency
547 # - population frequency matrix need to be well formed
548 # - get the frequency
549 # - calculate number of families on pop_freq
550 #############################
552 for ($pf=0; $pf<$#$pop_freq+1; $pf++){
553 $frequency += $pop_freq->[$pf]->[1];
555 if ( scalar @{$pop_freq->[$pf]} !=2){
556 $p_f_length = scalar @{$pop_freq->[$pf]};
557 $pop_freq_elements_error = 1;
558 last;
562 ###########################
563 ## error processing
564 ###########################
567 # The frequency shouldn't be greater than 1
568 if ($frequency >1) {
569 $self->warn("The frequency for this set is $frequency (greater than 1)\n");
572 # the haplotype matix is not well formed
573 if ($pop_freq_elements_error){
574 $self->throw("the frequency matrix is not well formed\n".
575 "\nThe number of elements for pattern ".($pf+1)." is ".
576 "$p_f_length\n".
577 "It should be 2 for pattern \"@{$pop_freq->[$pf]}\"\n".
578 "\nFormat should be:\n".
579 "haplotype_id\t frequency\n"
583 # the size does not fit on pop_freq array
584 # with the one in haplotype (input_block)
585 if ($pf != $number_of_families) {
586 $self->throw("The number of patterns on frequency array ($pf)\n".
587 "does not fit with the number of haplotype patterns on \n".
588 "haplotype array ($number_of_families)\n");
592 =head2 _do_it
595 Title : _do_it
596 Usage : _do_it($self)
597 Function: Process the input generating the results.
598 Returns : self hash
599 Args : self
600 Status : internal
602 =cut
604 #------------------------
605 sub _do_it{
606 #------------------------
608 my $self = shift;
610 # first we are goinf to define here all variables we are going to use
611 $self -> {'w_hap'} = [];
612 $self -> {'w_pop_freq'} = dclone ( $self ->pattern_freq() );
613 $self -> {'deg_pattern'} = {};
614 $self -> {'snp_type'} = {}; # type of snp on the set. see below
615 $self -> {'alleles_number'} = 0; # number of variations (biallelic,...)
616 $self -> {'snp_type_code'} = [];
617 $self -> {'ht_type'} = []; # store the snp type used on the htSet
618 $self -> {'split_hap'} = [];
619 $self -> {'snp_and_code'} = [];
622 # we classify the SNP under snp_type
623 $self->{snp_type}->{useful_snp} = dclone ( $self ->snp_ids() );
624 $self->{snp_type}->{deg_snp} = []; # deg snp
625 $self->{snp_type}->{silent_snp} = []; # not a real snp
627 # split the haplotype
628 _split_haplo ($self);
630 # first we convert to upper case the haplotype
631 # to make A the same as a for comparison
632 _to_upper_case( $self -> {w_hap} );
634 #######################################################
635 # check if any SNP has indetermination. If any SNP has
636 # indetermination this value will be removed.
637 #######################################################
638 _remove_deg ( $self );
640 #######################################################
641 # depending of the families you use some SNPs can be
642 # silent. This silent SNP's are not used on the
643 # creation of tags and has to be skipped from the
644 # analysis.
645 #######################################################
646 _rem_silent_snp ( $self );
648 #######################################################
649 # for the remaining SNP's we have to check if two
650 # families have the same value. If this is true, the families
651 # will produce the same result and therefore we will not find
652 # any pattern. So, the redundant families need to be take
653 # away from the analysis. But also considered for a further
654 # run.
656 # When we talk about a normal haplotype blocks this situation
657 # makes no sense but if we remove one of the snp because the
658 # degeneration two families can became the same.
659 # these families may be analised on a second round
660 #######################################################
662 _find_deg_pattern ( $self );
664 #################################################################
665 # if the pattern list length is different to the lenght of the w_hap
666 # we can tell that tow columns have been considered as the same one
667 # and therefore we have to start to remove the values.
668 # remove all columns with degeneration
670 # For this calculation we don't use the pattern frequency.
671 # All patterns are the same, This selection makes
672 # sense when you have different frequency.
674 # Note: on this version we don't classify the haplotype by frequency
675 # but if you need to do it. This is the place to do it!!!!
677 # In reality you don't need to sort the values because you will remove
678 # the values according to their values.
680 # But as comes from a hash, the order could be different and as a
681 # consequence the code generate on every run of the same set could
682 # differ. That is not important. In fact, does not matter but could
683 # confuse people.
684 #################################################################
686 my @tmp =sort { $a <=> $b}
687 keys %{$self -> {deg_pattern}}; # just count the families
689 # if the size of the list is different to the size of the degenerated
690 # family. There is degeneration. And the redundancies will be
691 # removed.
692 if($#tmp != $#{$self -> { w_hap } } ){
693 _keep_these_patterns($self->{w_hap}, \@tmp);
694 _keep_these_patterns($self->{w_pop_freq}, \@tmp);
697 #################################################################
698 # the steps made before about removing snp and cluster families
699 # are just needed pre-process the haplotype before.
701 # Now is when the fun starts.
704 # once we have the this minimal matrix, we have to calculate the
705 # max multipliticy for the values. The max number of alleles found
706 # on the set. A normal haplotype is biallelic but we can not
707 # reject multiple variations.
708 ##################################################################
710 _alleles_number ( $self );
712 ##################################################################
713 # Now we have to convert the haplotype into number
715 # A C C - T
716 # C A G G C
717 # A C C C T
718 # C G G G C
720 # one haplotype like this transformed into number produce this result
722 # 0 0 0 0 0
723 # 1 1 1 1 1
724 # 0 0 0 2 0
725 # 1 2 1 1 1
727 ##################################################################
729 _convert_to_numbers( $self );
731 ###################################################################
732 # The next step is to calculate the type of the SNP.
733 # This process is made based on the position of the SNP, the value
734 # and its multiplicity.
735 ###################################################################
737 _snp_type_code( $self );
739 ###################################################################
740 # now we have all information we need to calculate the haplotype
741 # tagging SNP htSNP
742 ###################################################################
744 _htSNP( $self );
746 ###################################################################
747 # patch:
749 # all SNP have a code. but if the SNP is not used this code must
750 # be zero in case of silent SNP. This looks not to informative
751 # because all the information is already there. But this method
752 # compile the full set.
753 ###################################################################
755 _snp_and_code_summary( $self );
758 =head2 input_block
760 Title : input_block
761 Usage : $obj->input_block()
762 Function: returns input block
763 Returns : reference to array of array
764 Args : none
765 Status : public
767 =cut
769 #------------------------
770 sub input_block{
771 #------------------------
773 my $self = shift;
774 return $self -> {input_block};
777 =head2 hap_length
779 Title : hap_length
780 Usage : $obj->hap_length()
781 Function: get numbers of SNP on the haplotype
782 Returns : scalar
783 Args : none
784 Status : public
786 =cut
788 #------------------------
789 sub hap_length{
790 #------------------------
792 my $self = shift;
793 return scalar @{$self -> {'_snp_ids'}};
797 =head2 pop_freq
799 Title : pop_freq
800 Usage : $obj->pop_freq()
801 Function: returns population frequency
802 Returns : reference to array
803 Args : none
804 Status : public
806 =cut
808 #------------------------
809 sub pop_freq{
810 #------------------------
812 my $self = shift;
813 return $self -> {pop_freq}
817 =head2 deg_snp
820 Title : deg_snp
821 Usage : $obj->deg_snp()
822 Function: returns snp_removes due to indetermination on their values
823 Returns : reference to array
824 Args : none
825 Status : public
827 =cut
829 #------------------------
830 sub deg_snp{
831 #------------------------
832 my $self = shift;
833 return $self -> {snp_type} ->{deg_snp};
837 =head2 snp_type
840 Title : snp_type
841 Usage : $obj->snp_type()
842 Function: returns hash with SNP type
843 Returns : reference to hash
844 Args : none
845 Status : public
847 =cut
849 #------------------------
850 sub snp_type{
851 #------------------------
852 my $self = shift;
853 return $self -> {snp_type};
857 =head2 silent_snp
860 Title : silent_snp
861 Usage : $obj->silent_snp()
862 Function: some SNP's are silent (not contibuting to the haplotype)
863 and are not considering for this analysis
864 Returns : reference to a array
865 Args : none
866 Status : public
868 =cut
870 #------------------------
871 sub silent_snp{
872 #------------------------
873 my $self = shift;
874 return $self -> {snp_type} ->{silent_snp};
878 =head2 useful_snp
881 Title : useful_snp
882 Usage : $obj->useful_snp()
883 Function: returns list of SNP's that are can be used as htSNP. Some
884 of them can produce the same information. But this is
885 not considered here.
886 Returns : reference to a array
887 Args : none
888 Status : public
890 =cut
892 #------------------------
893 sub useful_snp{
894 #------------------------
895 my $self = shift;
896 return $self -> {snp_type} ->{useful_snp};
900 =head2 ht_type
903 Title : ht_type
904 Usage : $obj->ht_type()
905 Function: every useful SNP has a numeric code dependending of its
906 value and position. For a better description see
907 description of the module.
908 Returns : reference to a array
909 Args : none
910 Status : public
912 =cut
914 #------------------------
915 sub ht_type{
916 #------------------------
917 my $self = shift;
918 return $self -> {ht_type};
920 =head2 ht_set
923 Title : ht_set
924 Usage : $obj->ht_set()
925 Function: returns the minimal haplotype in numerical format. This
926 haplotype contains the maximal information about the
927 haplotype variations but with no redundancies. It's the
928 minimal set that describes the haplotype.
929 Returns : reference to an array of arrays
930 Args : none
931 Status : public
933 =cut
935 #------------------------
936 sub ht_set{
937 #------------------------
938 my $self = shift;
939 return $self -> {w_hap};
942 =head2 snp_type_code
945 Title : snp_type_code
946 Usage : $obj->snp_type_code()
947 Function: returns the numeric code of the SNPs that need to be
948 tagged that correspond to the SNP's considered in ht_set.
949 Returns : reference to an array
950 Args : none
951 Status : public
953 =cut
955 #------------------------
956 sub snp_type_code{
957 #------------------------
958 my $self = shift;
959 return $self -> {snp_type_code};
962 =head2 snp_and_code
965 Title : snp_and_code
966 Usage : $obj->snp_and_code()
967 Function: Returns the full list of SNP's and the code associate to
968 them. If the SNP belongs to the group useful_snp it keep
969 this code. If the SNP is silent the code is 0. And if the
970 SNP is degenerated the code is -1.
971 Returns : reference to an array of array
972 Args : none
973 Status : public
975 =cut
977 #------------------------
978 sub snp_and_code{
979 #------------------------
980 my $self = shift;
981 return $self -> {'snp_and_code'};
984 =head2 deg_pattern
987 Title : deg_pattern
988 Usage : $obj->deg_pattern()
989 Function: Returns the a list with the degenerated haplotype.
990 Sometimes due to degeneration some haplotypes looks
991 the same and if we don't remove them it won't find
992 any tag.
993 Returns : reference to a hash of array
994 Args : none
995 Status : public
997 =cut
999 #------------------------
1000 sub deg_pattern{
1001 #------------------------
1002 my $self = shift;
1004 return $self -> {'deg_pattern'};
1007 =head2 split_hap
1010 Title : split_hap
1011 Usage : $obj->split_hap()
1012 Function: simple representation of the haplotype base by base
1013 Same information that input haplotype but base based.
1014 Returns : reference to an array of array
1015 Args : none
1016 Status : public
1018 =cut
1020 #------------------------
1021 sub split_hap{
1022 #------------------------
1023 my $self = shift;
1024 return $self -> {'split_hap'};
1027 =head2 _split_haplo
1029 Title : _split_haplo
1030 Usage : _split_haplo($self)
1031 Function: Take a haplotype and split it into bases
1032 Returns : self
1033 Args : none
1034 Status : internal
1036 =cut
1038 #------------------------
1039 sub _split_haplo {
1040 #------------------------
1041 my $self = shift;
1043 my $in = $self ->{'_haplotype_block'};
1044 my $out = $self ->{'w_hap'};
1046 # split every haplotype and store the result into $out
1047 foreach (@$in){
1048 push @$out, [split (//,$_)];
1051 $self -> {'split_hap'} = dclone ($out);
1054 # internal method to convert the haplotype to uppercase
1057 =head2 _to_upper_case
1060 Title : _to_upper_case
1061 Usage : _to_upper_case()
1062 Function: make SNP or in-dels Upper case
1063 Returns : self
1064 Args : an AoA ref
1065 Status : private
1067 =cut
1069 #------------------------
1070 sub _to_upper_case {
1071 #------------------------
1072 my ($arr) =@_;
1074 foreach my $aref (@$arr){
1075 foreach my $value (@{$aref} ){
1076 $value = uc $value;
1082 =head2 _remove_deg
1085 Title : _remove_deg
1086 Usage : _remove_deg()
1087 Function: when have a indetermination or strange value this SNP
1088 is removed
1089 Returns : haplotype family set and degeneration list
1090 Args : ref to an AoA and a ref to an array
1091 Status : internal
1093 =cut
1095 #------------------------
1096 sub _remove_deg {
1097 #------------------------
1098 my $self = shift;
1100 my $hap = $self->{w_hap};
1101 my $snp = $self->{snp_type}->{useful_snp};
1102 my $deg_snp = $self->{snp_type}->{deg_snp};
1104 my $rem = []; # take the position of the array to be removed
1106 # first we work on the columns we have void values
1107 $rem = _find_indet($hap,$rem); # find degenerated columns
1109 if (@$rem){
1111 # remove column on haplotype
1112 _remove_col($hap,$rem); # remove list
1114 # now remove the values from SNP id
1115 _remove_snp_id($snp,$deg_snp,$rem); # remove list
1120 =head2 _rem_silent_snp
1123 Title : _rem_silent_snp
1124 Usage : _rem_silent_snp()
1125 Function: there is the remote possibilty that one SNP won't be a
1126 real SNP on this situation we have to remove this SNP,
1127 otherwise the program won't find any tag
1128 Returns : nonthing
1129 Args : ref to an AoA and a ref to an array
1130 Status : internal
1132 =cut
1134 #------------------------
1135 sub _rem_silent_snp {
1136 #------------------------
1137 my $self = shift;
1139 my $hap = $self->{w_hap};
1140 my $snp = $self->{snp_type}->{useful_snp};
1141 my $silent_snp = $self->{snp_type}->{silent_snp};
1143 my $rem = []; # store the positions to be removed
1145 #find columns with no variation on the SNP, Real snp?
1146 $rem = _find_silent_snps($hap);
1148 if (@$rem){
1150 # remove column on haplotype
1151 _remove_col($hap,$rem);
1153 # remove the values from SNP id
1154 _remove_snp_id($snp,$silent_snp,$rem);
1159 =head2 _find_silent_snps
1162 Title : _find_silent_snps
1163 Usage :
1164 Function: list of snps that are not SNPs. All values for that
1165 SNPs on the set is the same one. Look stupid but can
1166 happend and if this happend you will not find any tag
1167 Returns : nothing
1168 Args :
1169 Status :
1171 =cut
1173 #------------------------
1174 sub _find_silent_snps{
1175 #------------------------
1176 my ($arr)=@_;
1178 my $list =[]; # no snp list;
1180 # determine the number of snp by the length of the first row.
1181 # we assume that the matrix is squared.
1182 my $colsn= @{$arr->[0]};
1184 for (my $i=0;$i<$colsn;$i++){
1185 my $different =0; # check degeneration
1187 for my $r (1..$#$arr){
1188 if($arr->[0][$i] ne $arr->[$r][$i]){
1189 $different =1;
1190 last;
1194 if(!$different){
1195 push (@$list, $i);
1199 return $list;
1203 =head2 _find_indet
1206 Title : _find_indet
1207 Usage :
1208 Function: find column (SNP) with invalid or degenerated values
1209 and store this values into the second parameter suplied.
1210 Returns : nothing
1211 Args : ref to AoA and ref to an array
1212 Status : internal
1214 =cut
1216 #------------------------
1217 sub _find_indet{
1218 #------------------------
1219 my ($arr, $list)=@_;
1221 foreach my $i(0..$#$arr){
1222 foreach my $j(0..$#{$arr->[$i]}){
1223 unless ($arr->[$i][$j] =~ /[ACTG-]/){
1224 if ($#$list<0){
1225 push(@$list,$j);
1227 else{
1228 my $found =0; # check if already exist the value
1229 foreach my $k(0..$#$list){
1230 $found =1 if ($list->[$k] eq $j);
1231 last if ($found);
1233 if(!$found){
1234 push(@$list,$j);
1241 @$list = sort { $a <=> $b} @$list;
1243 return $list;
1246 =head2 _remove_col
1248 Title : _remove_col
1249 Usage :
1250 Function: remove columns contained on the second array from
1251 the first arr
1252 Returns : nothing
1253 Args : array of array reference and array reference
1254 Status : internal
1256 =cut
1258 #------------------------
1259 sub _remove_col{
1260 #------------------------
1261 my ($arr,$rem)=@_;
1263 foreach my $col (reverse @$rem){
1264 splice @$_, $col, 1 for @$arr;
1269 =head2 _remove_snp_id
1271 Title : _remove_snp_id
1272 Usage :
1273 Function: remove columns contained on the second array from
1274 the first arr
1275 Returns : nothing
1276 Args : array of array reference and array reference
1277 Status : internal
1279 =cut
1281 #------------------------
1282 sub _remove_snp_id{
1283 #------------------------
1284 my ($arr,$removed,$rem_list)=@_;
1286 push @$removed, splice @$arr, $_, 1 foreach reverse @$rem_list;
1290 =head2 _find_deg_pattern
1292 Title : _find_deg_pattern
1293 Usage :
1294 Function: create a list with the degenerated patterns
1295 Returns : @array
1296 Args : a ref to AoA
1297 Status : public
1299 =cut
1301 #------------------------
1302 sub _find_deg_pattern{
1303 #------------------------
1304 my $self = shift;
1306 my $arr = $self ->{w_hap}; # the working haplotype
1307 my $list = $self ->{'deg_pattern'}; # degenerated patterns
1309 # we have to check all elements
1310 foreach my $i(0..$#$arr){
1311 # is the element has not been used create a key
1312 unless ( _is_on_hash ($list,\$i) ) {
1313 $list->{$i}=[$i];
1316 foreach my $j($i+1..$#$arr){
1317 my $comp = compare_arrays($arr->[$i],$arr->[$j]);
1319 if($comp){
1320 # as we have no elements we push this into the list
1321 # check for the first element
1322 my $key = _key_for_value($list,\$i);
1324 push (@{$list->{$key}},$j);
1326 last;
1333 #------------------------
1334 sub _key_for_value{
1335 #------------------------
1336 my($hash,$value)=@_;
1338 foreach my $key (keys %$hash){
1339 if( _is_there(\@{$hash->{$key}},$value)){
1340 return $key;
1345 #------------------------
1346 sub _is_on_hash{
1347 #------------------------
1348 my($hash,$value)=@_;
1350 foreach my $key (keys %$hash){
1351 if( _is_there(\@{$hash->{$key}},$value)){
1352 return 1;
1357 #------------------------
1358 sub _is_there{
1359 #------------------------
1361 my($arr,$value)=@_;
1363 foreach my $el (@$arr){
1364 if ($el eq $$value){
1365 return 1;
1371 =head2 _keep_these_patterns
1374 Title : _keep_these_patterns
1375 Usage :
1376 Function: this is a basic approach, take a LoL and a list,
1377 keep just the columns included on the list
1378 Returns : nothing
1379 Args : an AoA and an array
1380 Status : public
1382 =cut
1384 #------------------------
1385 sub _keep_these_patterns{
1386 #------------------------
1387 my ($arr,$list)=@_;
1389 # by now we just take one of the repetitions but you can weight
1390 # the values by frequency
1392 my @outValues=();
1394 foreach my $k (@$list){
1395 push @outValues, $arr->[$k];
1398 #make arr to hold the new values
1399 @$arr= @{dclone(\@outValues)};
1404 =head2 compare_arrays
1407 Title : compare_arrays
1408 Usage :
1409 Function: take two arrays and compare their values
1410 Returns : 1 if the two values are the same
1411 0 if the values are different
1412 Args : an AoA and an array
1413 Status : public
1415 =cut
1417 #------------------------
1418 sub compare_arrays {
1419 #------------------------
1420 my ($first, $second) = @_;
1421 return 0 unless @$first == @$second;
1422 for (my $i = 0; $i < @$first; $i++) {
1423 return 0 if $first->[$i] ne $second->[$i];
1425 return 1;
1429 =head2 _convert_to_numbers
1432 Title : _convert_to_numbers
1433 Usage : _convert_to_numbers()
1434 Function: tranform the haplotype into numbers. before to do that
1435 we have to consider the variation on the set.
1436 Returns : nonthing
1437 Args : ref to an AoA and a ref to an array
1438 Status : internal
1440 =cut
1442 #------------------------
1443 sub _convert_to_numbers{
1444 #------------------------
1445 my $self = shift;
1447 my $hap_ref = $self->{w_hap};
1448 my $mm = $self->{alleles_number};
1450 # the first element is considered as zero. The first modification
1451 # is consider as one and so on.
1453 my $length = @{ @$hap_ref[0]}; #length of the haplotype
1455 for (my $c = 0; $c<$length;$c++){
1457 my @al=();
1459 for my $r (0..$#$hap_ref){
1461 push @al,$hap_ref->[$r][$c]
1462 unless _is_there(\@al,\$hap_ref->[$r][$c]);
1464 $hap_ref->[$r][$c] = get_position(\@al,\$hap_ref->[$r][$c]);
1470 =head2 _snp_type_code
1473 Title : _snp_type_code
1474 Usage :
1475 Function:
1476 we have to create the snp type code for each version.
1477 The way the snp type is created is the following:
1479 we take the number value for every SNP and do the
1480 following calculation
1482 let be a SNP set as follow:
1488 and multiplicity 3
1489 on this case the situation is:
1491 sum (value * multiplicity ^ position) for each SNP
1493 0 * 3 ^ 0 + 1 * 3 ^ 1 + 1 * 3 ^ 2 = 12
1494 0 * 3 ^ 0 + 1 * 3 ^ 1 + 2 * 3 ^ 2 = 21
1495 Returns : nothing
1496 Args : $self
1497 Status : private
1499 =cut
1501 #------------------------
1502 sub _snp_type_code{
1503 #------------------------
1504 my $self = shift;
1506 my $hap = $self->{w_hap};
1507 my $arr = $self->{snp_type_code};
1508 my $al = $self->{alleles_number};
1510 my $length = @{ $hap->[0]}; #length of the haplotype
1512 for (my $c=0; $c<$length; $c++){
1513 for my $r (0..$#$hap){
1514 $arr->[$c] += $hap->[$r][$c] * $al ** $r;
1519 #################################################
1520 # return the position of an element in one array
1521 # The element is always present on the array
1522 #################################################
1524 #------------------------
1525 sub get_position{
1526 #------------------------
1528 my($array, $value)=@_;
1530 for my $i(0..$#$array) {
1531 if ($array->[$i] eq $$value){
1532 return $i;
1539 =head2 _alleles_number
1542 Title : _alleles_number
1543 Usage :
1544 Function: calculate the max number of alleles for a haplotype and
1545 if the number. For each SNP the number is stored and the
1546 max number of alleles for a SNP on the set is returned
1547 Returns : max number of alleles (a scalar storing a number)
1548 Args : ref to AoA
1549 Status : public
1551 =cut
1553 #------------------------
1554 sub _alleles_number{
1555 #------------------------
1557 my $self = shift;
1559 my $hap_ref = $self ->{w_hap}; # working haplotype
1561 my $length = @{ @$hap_ref[0]}; # length of the haplotype
1563 for (my $c = 0; $c<$length;$c++){
1565 my %alleles=();
1567 for my $r (0..$#$hap_ref){
1568 $alleles{ $hap_ref->[$r][$c] } =1; # new key for every new snp
1571 # if the number of alleles for this column is
1572 # greater than before set $m value as allele number
1573 if ($self->{alleles_number} < keys %alleles) {
1574 $self->{alleles_number} = keys %alleles;
1580 =head2 _htSNP
1583 Title : _htSNP
1584 Usage : _htSNP()
1585 Function: calculate the minimal set that contains all information of the
1586 haplotype.
1587 Returns : nonthing
1588 Args : ref to an AoA and a ref to an array
1589 Status : internal
1591 =cut
1593 #------------------------
1594 sub _htSNP{
1595 #------------------------
1596 my $self = shift;
1598 my $hap = $self->{'w_hap'};
1599 my $type = $self->{'snp_type_code'};
1600 my $set = $self->{'ht_type'};
1601 my $out = []; # store the minimal set
1603 my $nc=0; # new column for the output values
1605 # pass for every value of the snp_type_code
1606 for my $c (0..$#$type){
1608 my $exist =0;
1610 # every new value (not present) is pushed into set
1611 if ( ! _is_there( $set,\$type->[$c] ) ){
1612 push @$set, $type->[$c];
1614 $exist =1;
1616 for my $r(0..$#$hap){
1617 #save value of the snp for every SNP
1618 $out->[$r][$nc]= $hap->[$r][$c];
1622 if ($exist){ $nc++ };
1625 @$hap = @{dclone $out};
1628 =head2 _snp_and_code_summary
1630 Title : _snp_and_code_summary
1631 Usage : _snp_and_code_summary()
1632 Function: compile on a list all SNP and the code for each. This
1633 information can be also obtained combining snp_type and
1634 snp_type_code but on these results the information about
1635 the rest of SNP's are not compiled as table.
1637 0 will be silent SNPs
1638 -1 are degenerated SNPs
1639 and the rest of positive values are the code for useful SNP
1641 Returns : nonthing
1642 Args : ref to an AoA and a ref to an array
1643 Status : internal
1645 =cut
1647 #------------------------
1648 sub _snp_and_code_summary{
1649 #------------------------
1650 my $self = shift;
1652 my $snp_type_code = $self->{'snp_type_code'};
1653 my $useful_snp = $self->{'snp_type'}->{'useful_snp'};
1654 my $silent_snp = $self->{'snp_type'}->{'silent_snp'};
1655 my $deg_snp = $self->{'snp_type'}->{'deg_snp'};
1656 my $snp_ids = $self->snp_ids();
1657 my $snp_and_code = $self->{'snp_and_code'};
1659 # walk all SNP's and generate code for each
1661 # do a practical thing. Consider all snp silent
1662 foreach my $i (0..$#$snp_ids){
1664 # assign zero to silent
1665 my $value=0;
1667 # active SNPs
1668 foreach my $j (0..$#$useful_snp){
1669 if ($snp_ids->[$i] eq $useful_snp->[$j]){
1670 $value = $snp_type_code->[$j];
1671 last;
1675 # assign -1 to degenerated
1676 foreach my $j (0..$#$deg_snp){
1677 if ($snp_ids->[$i] eq $deg_snp->[$j]){
1678 $value = -1;
1679 last;
1683 push @$snp_and_code, [$snp_ids->[$i], $value];