1 # module Bio::PopGen::HtSNP.pm
2 # cared by Pedro M. Gomez-Fabre <pgf18872-at-gsk-dot-com>
8 Bio::PopGen::HtSNP.pm- Select htSNP from a haplotype set
12 use Bio::PopGen::HtSNP;
14 my $obj = Bio::PopGen::HtSNP->new($hap,$snp,$pop);
18 Select the minimal set of SNP that contains the full information about
19 the haplotype without redundancies.
21 Take as input the followin values:
25 =item - the haplotype block (array of array).
27 =item - the snp id (array).
29 =item - family information and frequency (array of array).
33 The final haplotype is generated in a numerical format and the SNP's
34 sets can be retrieve from the module.
39 - If you force to include a family with indetermination, the SNP's
40 with indetermination will be removed from the analysis, so consider
41 before to place your data set what do you really want to do.
43 - If two families have the same information (identical haplotype), one
44 of them will be removed and the removed files will be stored classify
47 - Only are accepted for calculation A, C, G, T and - (as deletion) and
48 their combinations. Any other value as n or ? will be considered as
49 degenerations due to lack of information.
53 On a haplotype set is expected that some of the SNP and their
54 variations contribute in the same way to the haplotype. Eliminating
55 redundancies will produce a minimal set of SNP's that can be used as
56 input for a taging selection process. On the process SNP's with the
57 same variation are clustered on the same group.
59 The idea is that because the tagging haplotype process is
60 exponential. All redundant information we could eliminate on the
61 tagging process will help to find a quick result.
65 my $obj = Bio::PopGen::HtSNP->new
66 (-haplotype_block => \@haplotype_patterns,
67 -snp_ids => \@snp_ids,
68 -pattern_freq => \@pattern_name_and_freq);
70 where $hap, $snp and $pop are in the format:
76 ]; # haplotype patterns' id
78 my $snp = [qw/s1 s2 s3 s4/]; # snps' Id's
84 ]; # haplotype_pattern_id Frequency
88 See Below for more detailed summaries.
93 =head2 How the process is working with one example
95 Let's begin with one general example of the code.
104 The first thing to to is to B<split the haplotype> into characters.
111 Now we have to B<convert> the haplotype to B<Upercase>. This
112 will produce the same SNP if we have input a or A.
119 The program admit as values any combination of ACTG and - (deletions).
120 The haplotype is B<converted to number>, considering the first variation
121 as zero and the alternate value as 1 (see expanded description below).
128 Once we have the haplotype converted to numbers we have to generate the
129 snp type information for the haplotype.
132 B<SNP code = SUM ( value * multiplicity ^ position );>
135 SUM is the sum of the values for the SNP
136 value is the SNP number code (0 [generally for the mayor allele],
137 1 [for the minor allele].
138 position is the position on the block.
140 For this example the code is:
146 ------------------------------------------------------------------
147 14 10 12 4 2 14 14 14 14
149 14 = 0*2^0 + 1*2^1 + 1*2^2 + 1*2^3
150 12 = 0*2^0 + 1*2^1 + 0*2^2 + 1*2^3
153 Once we have the families classify. We will B<take> just the SNP's B<not
158 This information will be B<passed to the tag module> is you want to tag
161 Whatever it happens to one SNPs of a class will happen to a SNP of
162 the same class. Therefore you don't need to scan redundancies
164 =head2 Working with fuzzy data.
166 This module is designed to work with fuzzy data. As the source of the
167 haplotype is diverse. The program assume that some haplotypes can be
168 generated using different values. If there is any indetermination (? or n)
169 or any other degenerated value or invalid. The program will take away
170 This SNP and will leave that for a further analysis.
172 On a complex situation:
182 On this haplotype everything is happening. We have a multialelic variance.
183 We have indeterminations. We have deletions and we have even one SNP
184 which is not a real SNP.
186 The buiding process will be the same on this situation.
188 Convert the haplotype to uppercase.
198 All columns that present indeterminations will be removed from the analysis
201 hapotype after remove columns:
211 All changes made on the haplotype matrix, will be also made on the SNP list.
213 snp_id_1 snp_id_2 snp_id_4 snp_id_6 snp_id_8 snp_id_9
215 now the SNP that is not one SNP will be removed from the analysis.
216 SNP with Id snp_id_4 (the one with all T's).
219 because of the removing. Some of the families will become the same and will
220 be clustered. A posteriori analysis will diference these families.
221 but because of the indetermination can not be distinguish.
231 The result of the mergering will go like:
238 Once again the changes made on the families and we merge the frequency (I<to be
241 Before to convert the haplotype into numbers we consider how many variations
242 we have on the set. On this case the variations are 3.
244 The control code will use on this situation base three as mutiplicity
250 -----------------------------------
253 And the minimal set for this combination is
260 B<NOTE:> this second example is a remote example an on normal conditions. This
261 conditions makes no sense, but as the haplotypes, can come from many sources
262 we have to be ready for all kind of combinations.
269 User feedback is an integral part of the evolution of this and other
270 Bioperl modules. Send your comments and suggestions preferably to
271 the Bioperl mailing list. Your participation is much appreciated.
273 bioperl-l@bioperl.org - General discussion
274 http://bioperl.org/wiki/Mailing_lists - About the mailing lists
276 =head2 Reporting Bugs
278 Report bugs to the Bioperl bug tracking system to help us keep track
279 of the bugs and their resolution. Bug reports can be submitted via
282 http://bugzilla.open-bio.org/
284 =head1 AUTHOR - Pedro M. Gomez-Fabre
286 Email pgf18872-at-gsk-dot-com
291 The rest of the documentation details each of the object methods.
292 Internal methods are usually preceded with a _
296 # Let the code begin...
298 package Bio
::PopGen
::HtSNP
;
300 use Storable
qw(dclone);
306 use base
qw(Bio::Root::Root);
310 Bio::PopGen::HtSNP->new(-haplotype_block -ids -pattern_freq)
317 Function: constructor of the class.
318 Usage : $obj-> Bio::PopGen::HtSNP->new(-haplotype_block
322 Args : input haplotype (array of array)
324 pop_freq (array of array)
330 my($class, @args) = @_;
332 my $self = $class->SUPER::new
(@args);
333 my ($haplotype_block,
335 $pattern_freq ) = $self->_rearrange([qw(HAPLOTYPE_BLOCK
337 PATTERN_FREQ)],@args);
339 if ($haplotype_block){
340 $self->haplotype_block($haplotype_block);
343 $self->throw("Haplotype block has not been defined.
347 $self->snp_ids($snp_ids);
350 $self->throw("Array with ids has not been defined.
354 $self->pattern_freq($pattern_freq);
357 $self->throw("Array with pattern id and frequency has not been defined.
361 # if the input values are not well formed complained and exit.
369 =head2 haplotype_block
371 Title : haplotype_block
372 Usage : my $haplotype_block = $HtSNP->haplotype_block();
373 Function: Get the haplotype block for a haplotype tagging selection
374 Returns : reference of array
375 Args : reference of array with haplotype pattern
382 return $self->{'_haplotype_block'} = shift if @_;
383 return $self->{'_haplotype_block'};
389 Usage : my $snp_ids = $HtSNP->$snp_ids();
390 Function: Get the ids for a haplotype tagging selection
391 Returns : reference of array
392 Args : reference of array with SNP ids
399 return $self->{'_snp_ids'} = shift if @_;
400 return $self->{'_snp_ids'};
407 Usage : my $pattern_freq = $HtSNP->pattern_freq();
408 Function: Get the pattern id and frequency for a haplotype
410 Returns : reference of array
411 Args : reference of array with SNP ids
417 return $self->{'_pattern_freq'} = shift if @_;
418 return $self->{'_pattern_freq'};
424 Usage : _check_input($self)
425 Function: check for errors on the input
432 #------------------------
434 #------------------------
438 _haplotype_length_error
($self);
439 _population_error
($self);
443 =head2 _haplotype_length_error
445 Title : _haplotype_length_error
446 Usage : _haplotype_length_error($self)
447 Function: check if the haplotype length is the same that the one on the
448 SNP id list. If not break and exit
456 #------------------------
457 sub _haplotype_length_error
{
458 #------------------------
462 my $input_block = $self->haplotype_block();
463 my $snp_ids = $self->snp_ids();
466 #############################
468 #############################
469 my $different_haplotype_length = 0;
471 ##############################
472 # get parameters used to find
474 ##############################
476 my $snp_number = scalar @
$snp_ids;
477 my $number_of_families = scalar @
$input_block;
478 my $h = 0; # haplotype position
481 ############################
484 # if the length differs from the number of ids
485 ############################
487 for ($h=0; $h<$#$input_block+1 ; $h++){
488 if (length $input_block->[$h] != $snp_number){
489 $different_haplotype_length = 1;
494 # haploytypes does not have the same length
495 if ($different_haplotype_length){
496 $self->throw("The number of snp ids is $snp_number and ".
497 "the length of the family (". ($h+1) .") [".
498 $input_block->[$h]."] is ".
499 length $input_block->[$h], "\n");
503 =head2 _population_error
506 Title : _population_error
507 Usage : _population_error($self)
508 Function: use input_block and pop_freq test if the number of elements
509 match. If doesn't break and quit.
517 #------------------------
518 sub _population_error
{
519 #------------------------
523 my $input_block = $self->haplotype_block();
524 my $pop_freq = $self->pattern_freq();
526 #############################
528 #############################
529 my $pop_freq_elements_error = 0; # matrix bad formed
531 ##############################
532 # get parameters used to find
534 ##############################
535 my $number_of_families = scalar @
$input_block;
537 my $pf = 0; # number of elements on population frequency
538 my $frequency = 0; # population frequency
541 # check if the pop_freq array is well formed and if the number
542 # of elements fit with the number of families
544 #############################
545 # check population frequency
547 # - population frequency matrix need to be well formed
548 # - get the frequency
549 # - calculate number of families on pop_freq
550 #############################
552 for ($pf=0; $pf<$#$pop_freq+1; $pf++){
553 $frequency += $pop_freq->[$pf]->[1];
555 if ( scalar @
{$pop_freq->[$pf]} !=2){
556 $p_f_length = scalar @
{$pop_freq->[$pf]};
557 $pop_freq_elements_error = 1;
562 ###########################
564 ###########################
567 # The frequency shouldn't be greater than 1
569 $self->warn("The frequency for this set is $frequency (greater than 1)\n");
572 # the haplotype matix is not well formed
573 if ($pop_freq_elements_error){
574 $self->throw("the frequency matrix is not well formed\n".
575 "\nThe number of elements for pattern ".($pf+1)." is ".
577 "It should be 2 for pattern \"@{$pop_freq->[$pf]}\"\n".
578 "\nFormat should be:\n".
579 "haplotype_id\t frequency\n"
583 # the size does not fit on pop_freq array
584 # with the one in haplotype (input_block)
585 if ($pf != $number_of_families) {
586 $self->throw("The number of patterns on frequency array ($pf)\n".
587 "does not fit with the number of haplotype patterns on \n".
588 "haplotype array ($number_of_families)\n");
596 Usage : _do_it($self)
597 Function: Process the input generating the results.
604 #------------------------
606 #------------------------
610 # first we are goinf to define here all variables we are going to use
611 $self -> {'w_hap'} = [];
612 $self -> {'w_pop_freq'} = dclone
( $self ->pattern_freq() );
613 $self -> {'deg_pattern'} = {};
614 $self -> {'snp_type'} = {}; # type of snp on the set. see below
615 $self -> {'alleles_number'} = 0; # number of variations (biallelic,...)
616 $self -> {'snp_type_code'} = [];
617 $self -> {'ht_type'} = []; # store the snp type used on the htSet
618 $self -> {'split_hap'} = [];
619 $self -> {'snp_and_code'} = [];
622 # we classify the SNP under snp_type
623 $self->{snp_type
}->{useful_snp
} = dclone
( $self ->snp_ids() );
624 $self->{snp_type
}->{deg_snp
} = []; # deg snp
625 $self->{snp_type
}->{silent_snp
} = []; # not a real snp
627 # split the haplotype
628 _split_haplo
($self);
630 # first we convert to upper case the haplotype
631 # to make A the same as a for comparison
632 _to_upper_case
( $self -> {w_hap
} );
634 #######################################################
635 # check if any SNP has indetermination. If any SNP has
636 # indetermination this value will be removed.
637 #######################################################
638 _remove_deg
( $self );
640 #######################################################
641 # depending of the families you use some SNPs can be
642 # silent. This silent SNP's are not used on the
643 # creation of tags and has to be skipped from the
645 #######################################################
646 _rem_silent_snp
( $self );
648 #######################################################
649 # for the remaining SNP's we have to check if two
650 # families have the same value. If this is true, the families
651 # will produce the same result and therefore we will not find
652 # any pattern. So, the redundant families need to be take
653 # away from the analysis. But also considered for a further
656 # When we talk about a normal haplotype blocks this situation
657 # makes no sense but if we remove one of the snp because the
658 # degeneration two families can became the same.
659 # these families may be analised on a second round
660 #######################################################
662 _find_deg_pattern
( $self );
664 #################################################################
665 # if the pattern list length is different to the lenght of the w_hap
666 # we can tell that tow columns have been considered as the same one
667 # and therefore we have to start to remove the values.
668 # remove all columns with degeneration
670 # For this calculation we don't use the pattern frequency.
671 # All patterns are the same, This selection makes
672 # sense when you have different frequency.
674 # Note: on this version we don't classify the haplotype by frequency
675 # but if you need to do it. This is the place to do it!!!!
677 # In reality you don't need to sort the values because you will remove
678 # the values according to their values.
680 # But as comes from a hash, the order could be different and as a
681 # consequence the code generate on every run of the same set could
682 # differ. That is not important. In fact, does not matter but could
684 #################################################################
686 my @tmp =sort { $a <=> $b}
687 keys %{$self -> {deg_pattern
}}; # just count the families
689 # if the size of the list is different to the size of the degenerated
690 # family. There is degeneration. And the redundancies will be
692 if($#tmp != $#{$self -> { w_hap } } ){
693 _keep_these_patterns
($self->{w_hap
}, \
@tmp);
694 _keep_these_patterns
($self->{w_pop_freq
}, \
@tmp);
697 #################################################################
698 # the steps made before about removing snp and cluster families
699 # are just needed pre-process the haplotype before.
701 # Now is when the fun starts.
704 # once we have the this minimal matrix, we have to calculate the
705 # max multipliticy for the values. The max number of alleles found
706 # on the set. A normal haplotype is biallelic but we can not
707 # reject multiple variations.
708 ##################################################################
710 _alleles_number
( $self );
712 ##################################################################
713 # Now we have to convert the haplotype into number
720 # one haplotype like this transformed into number produce this result
727 ##################################################################
729 _convert_to_numbers
( $self );
731 ###################################################################
732 # The next step is to calculate the type of the SNP.
733 # This process is made based on the position of the SNP, the value
734 # and its multiplicity.
735 ###################################################################
737 _snp_type_code
( $self );
739 ###################################################################
740 # now we have all information we need to calculate the haplotype
742 ###################################################################
746 ###################################################################
749 # all SNP have a code. but if the SNP is not used this code must
750 # be zero in case of silent SNP. This looks not to informative
751 # because all the information is already there. But this method
752 # compile the full set.
753 ###################################################################
755 _snp_and_code_summary
( $self );
761 Usage : $obj->input_block()
762 Function: returns input block
763 Returns : reference to array of array
769 #------------------------
771 #------------------------
774 return $self -> {input_block
};
780 Usage : $obj->hap_length()
781 Function: get numbers of SNP on the haplotype
788 #------------------------
790 #------------------------
793 return scalar @
{$self -> {'_snp_ids'}};
800 Usage : $obj->pop_freq()
801 Function: returns population frequency
802 Returns : reference to array
808 #------------------------
810 #------------------------
813 return $self -> {pop_freq
}
821 Usage : $obj->deg_snp()
822 Function: returns snp_removes due to indetermination on their values
823 Returns : reference to array
829 #------------------------
831 #------------------------
833 return $self -> {snp_type
} ->{deg_snp
};
841 Usage : $obj->snp_type()
842 Function: returns hash with SNP type
843 Returns : reference to hash
849 #------------------------
851 #------------------------
853 return $self -> {snp_type
};
861 Usage : $obj->silent_snp()
862 Function: some SNP's are silent (not contibuting to the haplotype)
863 and are not considering for this analysis
864 Returns : reference to a array
870 #------------------------
872 #------------------------
874 return $self -> {snp_type
} ->{silent_snp
};
882 Usage : $obj->useful_snp()
883 Function: returns list of SNP's that are can be used as htSNP. Some
884 of them can produce the same information. But this is
886 Returns : reference to a array
892 #------------------------
894 #------------------------
896 return $self -> {snp_type
} ->{useful_snp
};
904 Usage : $obj->ht_type()
905 Function: every useful SNP has a numeric code dependending of its
906 value and position. For a better description see
907 description of the module.
908 Returns : reference to a array
914 #------------------------
916 #------------------------
918 return $self -> {ht_type
};
924 Usage : $obj->ht_set()
925 Function: returns the minimal haplotype in numerical format. This
926 haplotype contains the maximal information about the
927 haplotype variations but with no redundancies. It's the
928 minimal set that describes the haplotype.
929 Returns : reference to an array of arrays
935 #------------------------
937 #------------------------
939 return $self -> {w_hap
};
945 Title : snp_type_code
946 Usage : $obj->snp_type_code()
947 Function: returns the numeric code of the SNPs that need to be
948 tagged that correspond to the SNP's considered in ht_set.
949 Returns : reference to an array
955 #------------------------
957 #------------------------
959 return $self -> {snp_type_code
};
966 Usage : $obj->snp_and_code()
967 Function: Returns the full list of SNP's and the code associate to
968 them. If the SNP belongs to the group useful_snp it keep
969 this code. If the SNP is silent the code is 0. And if the
970 SNP is degenerated the code is -1.
971 Returns : reference to an array of array
977 #------------------------
979 #------------------------
981 return $self -> {'snp_and_code'};
988 Usage : $obj->deg_pattern()
989 Function: Returns the a list with the degenerated haplotype.
990 Sometimes due to degeneration some haplotypes looks
991 the same and if we don't remove them it won't find
993 Returns : reference to a hash of array
999 #------------------------
1001 #------------------------
1004 return $self -> {'deg_pattern'};
1011 Usage : $obj->split_hap()
1012 Function: simple representation of the haplotype base by base
1013 Same information that input haplotype but base based.
1014 Returns : reference to an array of array
1020 #------------------------
1022 #------------------------
1024 return $self -> {'split_hap'};
1029 Title : _split_haplo
1030 Usage : _split_haplo($self)
1031 Function: Take a haplotype and split it into bases
1038 #------------------------
1040 #------------------------
1043 my $in = $self ->{'_haplotype_block'};
1044 my $out = $self ->{'w_hap'};
1046 # split every haplotype and store the result into $out
1048 push @
$out, [split (//,$_)];
1051 $self -> {'split_hap'} = dclone
($out);
1054 # internal method to convert the haplotype to uppercase
1057 =head2 _to_upper_case
1060 Title : _to_upper_case
1061 Usage : _to_upper_case()
1062 Function: make SNP or in-dels Upper case
1069 #------------------------
1070 sub _to_upper_case
{
1071 #------------------------
1074 foreach my $aref (@
$arr){
1075 foreach my $value (@
{$aref} ){
1086 Usage : _remove_deg()
1087 Function: when have a indetermination or strange value this SNP
1089 Returns : haplotype family set and degeneration list
1090 Args : ref to an AoA and a ref to an array
1095 #------------------------
1097 #------------------------
1100 my $hap = $self->{w_hap
};
1101 my $snp = $self->{snp_type
}->{useful_snp
};
1102 my $deg_snp = $self->{snp_type
}->{deg_snp
};
1104 my $rem = []; # take the position of the array to be removed
1106 # first we work on the columns we have void values
1107 $rem = _find_indet
($hap,$rem); # find degenerated columns
1111 # remove column on haplotype
1112 _remove_col
($hap,$rem); # remove list
1114 # now remove the values from SNP id
1115 _remove_snp_id
($snp,$deg_snp,$rem); # remove list
1120 =head2 _rem_silent_snp
1123 Title : _rem_silent_snp
1124 Usage : _rem_silent_snp()
1125 Function: there is the remote possibilty that one SNP won't be a
1126 real SNP on this situation we have to remove this SNP,
1127 otherwise the program won't find any tag
1129 Args : ref to an AoA and a ref to an array
1134 #------------------------
1135 sub _rem_silent_snp
{
1136 #------------------------
1139 my $hap = $self->{w_hap
};
1140 my $snp = $self->{snp_type
}->{useful_snp
};
1141 my $silent_snp = $self->{snp_type
}->{silent_snp
};
1143 my $rem = []; # store the positions to be removed
1145 #find columns with no variation on the SNP, Real snp?
1146 $rem = _find_silent_snps
($hap);
1150 # remove column on haplotype
1151 _remove_col
($hap,$rem);
1153 # remove the values from SNP id
1154 _remove_snp_id
($snp,$silent_snp,$rem);
1159 =head2 _find_silent_snps
1162 Title : _find_silent_snps
1164 Function: list of snps that are not SNPs. All values for that
1165 SNPs on the set is the same one. Look stupid but can
1166 happend and if this happend you will not find any tag
1173 #------------------------
1174 sub _find_silent_snps
{
1175 #------------------------
1178 my $list =[]; # no snp list;
1180 # determine the number of snp by the length of the first row.
1181 # we assume that the matrix is squared.
1182 my $colsn= @
{$arr->[0]};
1184 for (my $i=0;$i<$colsn;$i++){
1185 my $different =0; # check degeneration
1187 for my $r (1..$#$arr){
1188 if($arr->[0][$i] ne $arr->[$r][$i]){
1208 Function: find column (SNP) with invalid or degenerated values
1209 and store this values into the second parameter suplied.
1211 Args : ref to AoA and ref to an array
1216 #------------------------
1218 #------------------------
1219 my ($arr, $list)=@_;
1221 foreach my $i(0..$#$arr){
1222 foreach my $j(0..$#{$arr->[$i]}){
1223 unless ($arr->[$i][$j] =~ /[ACTG-]/){
1228 my $found =0; # check if already exist the value
1229 foreach my $k(0..$#$list){
1230 $found =1 if ($list->[$k] eq $j);
1241 @
$list = sort { $a <=> $b} @
$list;
1250 Function: remove columns contained on the second array from
1253 Args : array of array reference and array reference
1258 #------------------------
1260 #------------------------
1263 foreach my $col (reverse @
$rem){
1264 splice @
$_, $col, 1 for @
$arr;
1269 =head2 _remove_snp_id
1271 Title : _remove_snp_id
1273 Function: remove columns contained on the second array from
1276 Args : array of array reference and array reference
1281 #------------------------
1283 #------------------------
1284 my ($arr,$removed,$rem_list)=@_;
1286 push @
$removed, splice @
$arr, $_, 1 foreach reverse @
$rem_list;
1290 =head2 _find_deg_pattern
1292 Title : _find_deg_pattern
1294 Function: create a list with the degenerated patterns
1301 #------------------------
1302 sub _find_deg_pattern
{
1303 #------------------------
1306 my $arr = $self ->{w_hap
}; # the working haplotype
1307 my $list = $self ->{'deg_pattern'}; # degenerated patterns
1309 # we have to check all elements
1310 foreach my $i(0..$#$arr){
1311 # is the element has not been used create a key
1312 unless ( _is_on_hash
($list,\
$i) ) {
1316 foreach my $j($i+1..$#$arr){
1317 my $comp = compare_arrays
($arr->[$i],$arr->[$j]);
1320 # as we have no elements we push this into the list
1321 # check for the first element
1322 my $key = _key_for_value
($list,\
$i);
1324 push (@
{$list->{$key}},$j);
1333 #------------------------
1335 #------------------------
1336 my($hash,$value)=@_;
1338 foreach my $key (keys %$hash){
1339 if( _is_there
(\@
{$hash->{$key}},$value)){
1345 #------------------------
1347 #------------------------
1348 my($hash,$value)=@_;
1350 foreach my $key (keys %$hash){
1351 if( _is_there
(\@
{$hash->{$key}},$value)){
1357 #------------------------
1359 #------------------------
1363 foreach my $el (@
$arr){
1364 if ($el eq $$value){
1371 =head2 _keep_these_patterns
1374 Title : _keep_these_patterns
1376 Function: this is a basic approach, take a LoL and a list,
1377 keep just the columns included on the list
1379 Args : an AoA and an array
1384 #------------------------
1385 sub _keep_these_patterns
{
1386 #------------------------
1389 # by now we just take one of the repetitions but you can weight
1390 # the values by frequency
1394 foreach my $k (@
$list){
1395 push @outValues, $arr->[$k];
1398 #make arr to hold the new values
1399 @
$arr= @
{dclone
(\
@outValues)};
1404 =head2 compare_arrays
1407 Title : compare_arrays
1409 Function: take two arrays and compare their values
1410 Returns : 1 if the two values are the same
1411 0 if the values are different
1412 Args : an AoA and an array
1417 #------------------------
1418 sub compare_arrays
{
1419 #------------------------
1420 my ($first, $second) = @_;
1421 return 0 unless @
$first == @
$second;
1422 for (my $i = 0; $i < @
$first; $i++) {
1423 return 0 if $first->[$i] ne $second->[$i];
1429 =head2 _convert_to_numbers
1432 Title : _convert_to_numbers
1433 Usage : _convert_to_numbers()
1434 Function: tranform the haplotype into numbers. before to do that
1435 we have to consider the variation on the set.
1437 Args : ref to an AoA and a ref to an array
1442 #------------------------
1443 sub _convert_to_numbers
{
1444 #------------------------
1447 my $hap_ref = $self->{w_hap
};
1448 my $mm = $self->{alleles_number
};
1450 # the first element is considered as zero. The first modification
1451 # is consider as one and so on.
1453 my $length = @
{ @
$hap_ref[0]}; #length of the haplotype
1455 for (my $c = 0; $c<$length;$c++){
1459 for my $r (0..$#$hap_ref){
1461 push @al,$hap_ref->[$r][$c]
1462 unless _is_there
(\
@al,\
$hap_ref->[$r][$c]);
1464 $hap_ref->[$r][$c] = get_position
(\
@al,\
$hap_ref->[$r][$c]);
1470 =head2 _snp_type_code
1473 Title : _snp_type_code
1476 we have to create the snp type code for each version.
1477 The way the snp type is created is the following:
1479 we take the number value for every SNP and do the
1480 following calculation
1482 let be a SNP set as follow:
1489 on this case the situation is:
1491 sum (value * multiplicity ^ position) for each SNP
1493 0 * 3 ^ 0 + 1 * 3 ^ 1 + 1 * 3 ^ 2 = 12
1494 0 * 3 ^ 0 + 1 * 3 ^ 1 + 2 * 3 ^ 2 = 21
1501 #------------------------
1503 #------------------------
1506 my $hap = $self->{w_hap
};
1507 my $arr = $self->{snp_type_code
};
1508 my $al = $self->{alleles_number
};
1510 my $length = @
{ $hap->[0]}; #length of the haplotype
1512 for (my $c=0; $c<$length; $c++){
1513 for my $r (0..$#$hap){
1514 $arr->[$c] += $hap->[$r][$c] * $al ** $r;
1519 #################################################
1520 # return the position of an element in one array
1521 # The element is always present on the array
1522 #################################################
1524 #------------------------
1526 #------------------------
1528 my($array, $value)=@_;
1530 for my $i(0..$#$array) {
1531 if ($array->[$i] eq $$value){
1539 =head2 _alleles_number
1542 Title : _alleles_number
1544 Function: calculate the max number of alleles for a haplotype and
1545 if the number. For each SNP the number is stored and the
1546 max number of alleles for a SNP on the set is returned
1547 Returns : max number of alleles (a scalar storing a number)
1553 #------------------------
1554 sub _alleles_number
{
1555 #------------------------
1559 my $hap_ref = $self ->{w_hap
}; # working haplotype
1561 my $length = @
{ @
$hap_ref[0]}; # length of the haplotype
1563 for (my $c = 0; $c<$length;$c++){
1567 for my $r (0..$#$hap_ref){
1568 $alleles{ $hap_ref->[$r][$c] } =1; # new key for every new snp
1571 # if the number of alleles for this column is
1572 # greater than before set $m value as allele number
1573 if ($self->{alleles_number
} < keys %alleles) {
1574 $self->{alleles_number
} = keys %alleles;
1585 Function: calculate the minimal set that contains all information of the
1588 Args : ref to an AoA and a ref to an array
1593 #------------------------
1595 #------------------------
1598 my $hap = $self->{'w_hap'};
1599 my $type = $self->{'snp_type_code'};
1600 my $set = $self->{'ht_type'};
1601 my $out = []; # store the minimal set
1603 my $nc=0; # new column for the output values
1605 # pass for every value of the snp_type_code
1606 for my $c (0..$#$type){
1610 # every new value (not present) is pushed into set
1611 if ( ! _is_there
( $set,\
$type->[$c] ) ){
1612 push @
$set, $type->[$c];
1616 for my $r(0..$#$hap){
1617 #save value of the snp for every SNP
1618 $out->[$r][$nc]= $hap->[$r][$c];
1622 if ($exist){ $nc++ };
1625 @
$hap = @
{dclone
$out};
1628 =head2 _snp_and_code_summary
1630 Title : _snp_and_code_summary
1631 Usage : _snp_and_code_summary()
1632 Function: compile on a list all SNP and the code for each. This
1633 information can be also obtained combining snp_type and
1634 snp_type_code but on these results the information about
1635 the rest of SNP's are not compiled as table.
1637 0 will be silent SNPs
1638 -1 are degenerated SNPs
1639 and the rest of positive values are the code for useful SNP
1642 Args : ref to an AoA and a ref to an array
1647 #------------------------
1648 sub _snp_and_code_summary
{
1649 #------------------------
1652 my $snp_type_code = $self->{'snp_type_code'};
1653 my $useful_snp = $self->{'snp_type'}->{'useful_snp'};
1654 my $silent_snp = $self->{'snp_type'}->{'silent_snp'};
1655 my $deg_snp = $self->{'snp_type'}->{'deg_snp'};
1656 my $snp_ids = $self->snp_ids();
1657 my $snp_and_code = $self->{'snp_and_code'};
1659 # walk all SNP's and generate code for each
1661 # do a practical thing. Consider all snp silent
1662 foreach my $i (0..$#$snp_ids){
1664 # assign zero to silent
1668 foreach my $j (0..$#$useful_snp){
1669 if ($snp_ids->[$i] eq $useful_snp->[$j]){
1670 $value = $snp_type_code->[$j];
1675 # assign -1 to degenerated
1676 foreach my $j (0..$#$deg_snp){
1677 if ($snp_ids->[$i] eq $deg_snp->[$j]){
1683 push @
$snp_and_code, [$snp_ids->[$i], $value];