1 # module Bio::PopGen::HtSNP.pm
2 # cared by Pedro M. Gomez-Fabre <pgf18872-at-gsk-dot-com>
8 Bio::PopGen::HtSNP.pm- Select htSNP from a haplotype set
12 use Bio::PopGen::HtSNP;
14 my $obj = Bio::PopGen::HtSNP->new($hap,$snp,$pop);
18 Select the minimal set of SNP that contains the full information about
19 the haplotype without redundancies.
21 Take as input the followin values:
25 =item - the haplotype block (array of array).
27 =item - the snp id (array).
29 =item - family information and frequency (array of array).
33 The final haplotype is generated in a numerical format and the SNP's
34 sets can be retrieve from the module.
39 - If you force to include a family with indetermination, the SNP's
40 with indetermination will be removed from the analysis, so consider
41 before to place your data set what do you really want to do.
43 - If two families have the same information (identical haplotype), one
44 of them will be removed and the removed files will be stored classify
47 - Only are accepted for calculation A, C, G, T and - (as deletion) and
48 their combinations. Any other value as n or ? will be considered as
49 degenerations due to lack of information.
53 On a haplotype set is expected that some of the SNP and their
54 variations contribute in the same way to the haplotype. Eliminating
55 redundancies will produce a minimal set of SNP's that can be used as
56 input for a taging selection process. On the process SNP's with the
57 same variation are clustered on the same group.
59 The idea is that because the tagging haplotype process is
60 exponential. All redundant information we could eliminate on the
61 tagging process will help to find a quick result.
65 my $obj = Bio::PopGen::HtSNP->new
66 (-haplotype_block => \@haplotype_patterns,
67 -snp_ids => \@snp_ids,
68 -pattern_freq => \@pattern_name_and_freq);
70 where $hap, $snp and $pop are in the format:
76 ]; # haplotype patterns' id
78 my $snp = [qw/s1 s2 s3 s4/]; # snps' Id's
84 ]; # haplotype_pattern_id Frequency
88 See Below for more detailed summaries.
93 =head2 How the process is working with one example
95 Let's begin with one general example of the code.
104 The first thing to to is to B<split the haplotype> into characters.
111 Now we have to B<convert> the haplotype to B<Upercase>. This
112 will produce the same SNP if we have input a or A.
119 The program admit as values any combination of ACTG and - (deletions).
120 The haplotype is B<converted to number>, considering the first variation
121 as zero and the alternate value as 1 (see expanded description below).
128 Once we have the haplotype converted to numbers we have to generate the
129 snp type information for the haplotype.
132 B<SNP code = SUM ( value * multiplicity ^ position );>
135 SUM is the sum of the values for the SNP
136 value is the SNP number code (0 [generally for the mayor allele],
137 1 [for the minor allele].
138 position is the position on the block.
140 For this example the code is:
146 ------------------------------------------------------------------
147 14 10 12 4 2 14 14 14 14
149 14 = 0*2^0 + 1*2^1 + 1*2^2 + 1*2^3
150 12 = 0*2^0 + 1*2^1 + 0*2^2 + 1*2^3
153 Once we have the families classify. We will B<take> just the SNP's B<not
158 This information will be B<passed to the tag module> is you want to tag
161 Whatever it happens to one SNPs of a class will happen to a SNP of
162 the same class. Therefore you don't need to scan redundancies
164 =head2 Working with fuzzy data.
166 This module is designed to work with fuzzy data. As the source of the
167 haplotype is diverse. The program assume that some haplotypes can be
168 generated using different values. If there is any indetermination (? or n)
169 or any other degenerated value or invalid. The program will take away
170 This SNP and will leave that for a further analysis.
172 On a complex situation:
182 On this haplotype everything is happening. We have a multialelic variance.
183 We have indeterminations. We have deletions and we have even one SNP
184 which is not a real SNP.
186 The buiding process will be the same on this situation.
188 Convert the haplotype to uppercase.
198 All columns that present indeterminations will be removed from the analysis
201 hapotype after remove columns:
211 All changes made on the haplotype matrix, will be also made on the SNP list.
213 snp_id_1 snp_id_2 snp_id_4 snp_id_6 snp_id_8 snp_id_9
215 now the SNP that is not one SNP will be removed from the analysis.
216 SNP with Id snp_id_4 (the one with all T's).
219 because of the removing. Some of the families will become the same and will
220 be clustered. A posteriori analysis will diference these families.
221 but because of the indetermination can not be distinguish.
231 The result of the mergering will go like:
238 Once again the changes made on the families and we merge the frequency (I<to be
241 Before to convert the haplotype into numbers we consider how many variations
242 we have on the set. On this case the variations are 3.
244 The control code will use on this situation base three as mutiplicity
250 -----------------------------------
253 And the minimal set for this combination is
260 B<NOTE:> this second example is a remote example an on normal conditions. This
261 conditions makes no sense, but as the haplotypes, can come from many sources
262 we have to be ready for all kind of combinations.
269 User feedback is an integral part of the evolution of this and other
270 Bioperl modules. Send your comments and suggestions preferably to
271 the Bioperl mailing list. Your participation is much appreciated.
273 bioperl-l@bioperl.org - General discussion
274 http://bioperl.org/wiki/Mailing_lists - About the mailing lists
278 Please direct usage questions or support issues to the mailing list:
280 I<bioperl-l@bioperl.org>
282 rather than to the module maintainer directly. Many experienced and
283 reponsive experts will be able look at the problem and quickly
284 address it. Please include a thorough description of the problem
285 with code and data examples if at all possible.
287 =head2 Reporting Bugs
289 Report bugs to the Bioperl bug tracking system to help us keep track
290 of the bugs and their resolution. Bug reports can be submitted via
293 https://github.com/bioperl/bioperl-live/issues
295 =head1 AUTHOR - Pedro M. Gomez-Fabre
297 Email pgf18872-at-gsk-dot-com
302 The rest of the documentation details each of the object methods.
303 Internal methods are usually preceded with a _
307 # Let the code begin...
309 package Bio
::PopGen
::HtSNP
;
311 use Storable
qw(dclone);
317 use base
qw(Bio::Root::Root);
321 Bio::PopGen::HtSNP->new(-haplotype_block -ids -pattern_freq)
328 Function: constructor of the class.
329 Usage : $obj-> Bio::PopGen::HtSNP->new(-haplotype_block
333 Args : input haplotype (array of array)
335 pop_freq (array of array)
341 my($class, @args) = @_;
343 my $self = $class->SUPER::new
(@args);
344 my ($haplotype_block,
346 $pattern_freq ) = $self->_rearrange([qw(HAPLOTYPE_BLOCK
348 PATTERN_FREQ)],@args);
350 if ($haplotype_block){
351 $self->haplotype_block($haplotype_block);
354 $self->throw("Haplotype block has not been defined.
358 $self->snp_ids($snp_ids);
361 $self->throw("Array with ids has not been defined.
365 $self->pattern_freq($pattern_freq);
368 $self->throw("Array with pattern id and frequency has not been defined.
372 # if the input values are not well formed complained and exit.
380 =head2 haplotype_block
382 Title : haplotype_block
383 Usage : my $haplotype_block = $HtSNP->haplotype_block();
384 Function: Get the haplotype block for a haplotype tagging selection
385 Returns : reference of array
386 Args : reference of array with haplotype pattern
393 return $self->{'_haplotype_block'} = shift if @_;
394 return $self->{'_haplotype_block'};
400 Usage : my $snp_ids = $HtSNP->$snp_ids();
401 Function: Get the ids for a haplotype tagging selection
402 Returns : reference of array
403 Args : reference of array with SNP ids
410 return $self->{'_snp_ids'} = shift if @_;
411 return $self->{'_snp_ids'};
418 Usage : my $pattern_freq = $HtSNP->pattern_freq();
419 Function: Get the pattern id and frequency for a haplotype
421 Returns : reference of array
422 Args : reference of array with SNP ids
428 return $self->{'_pattern_freq'} = shift if @_;
429 return $self->{'_pattern_freq'};
435 Usage : _check_input($self)
436 Function: check for errors on the input
443 #------------------------
445 #------------------------
449 _haplotype_length_error
($self);
450 _population_error
($self);
454 =head2 _haplotype_length_error
456 Title : _haplotype_length_error
457 Usage : _haplotype_length_error($self)
458 Function: check if the haplotype length is the same that the one on the
459 SNP id list. If not break and exit
467 #------------------------
468 sub _haplotype_length_error
{
469 #------------------------
473 my $input_block = $self->haplotype_block();
474 my $snp_ids = $self->snp_ids();
477 #############################
479 #############################
480 my $different_haplotype_length = 0;
482 ##############################
483 # get parameters used to find
485 ##############################
487 my $snp_number = scalar @
$snp_ids;
488 my $number_of_families = scalar @
$input_block;
489 my $h = 0; # haplotype position
492 ############################
495 # if the length differs from the number of ids
496 ############################
498 for ($h=0; $h<$#$input_block+1 ; $h++){
499 if (length $input_block->[$h] != $snp_number){
500 $different_haplotype_length = 1;
505 # haploytypes does not have the same length
506 if ($different_haplotype_length){
507 $self->throw("The number of snp ids is $snp_number and ".
508 "the length of the family (". ($h+1) .") [".
509 $input_block->[$h]."] is ".
510 length $input_block->[$h], "\n");
514 =head2 _population_error
517 Title : _population_error
518 Usage : _population_error($self)
519 Function: use input_block and pop_freq test if the number of elements
520 match. If doesn't break and quit.
528 #------------------------
529 sub _population_error
{
530 #------------------------
534 my $input_block = $self->haplotype_block();
535 my $pop_freq = $self->pattern_freq();
537 #############################
539 #############################
540 my $pop_freq_elements_error = 0; # matrix bad formed
542 ##############################
543 # get parameters used to find
545 ##############################
546 my $number_of_families = scalar @
$input_block;
548 my $pf = 0; # number of elements on population frequency
549 my $frequency = 0; # population frequency
552 # check if the pop_freq array is well formed and if the number
553 # of elements fit with the number of families
555 #############################
556 # check population frequency
558 # - population frequency matrix need to be well formed
559 # - get the frequency
560 # - calculate number of families on pop_freq
561 #############################
563 for ($pf=0; $pf<$#$pop_freq+1; $pf++){
564 $frequency += $pop_freq->[$pf]->[1];
566 if ( scalar @
{$pop_freq->[$pf]} !=2){
567 $p_f_length = scalar @
{$pop_freq->[$pf]};
568 $pop_freq_elements_error = 1;
573 ###########################
575 ###########################
578 # The frequency shouldn't be greater than 1
580 $self->warn("The frequency for this set is $frequency (greater than 1)\n");
583 # the haplotype matix is not well formed
584 if ($pop_freq_elements_error){
585 $self->throw("the frequency matrix is not well formed\n".
586 "\nThe number of elements for pattern ".($pf+1)." is ".
588 "It should be 2 for pattern \"@{$pop_freq->[$pf]}\"\n".
589 "\nFormat should be:\n".
590 "haplotype_id\t frequency\n"
594 # the size does not fit on pop_freq array
595 # with the one in haplotype (input_block)
596 if ($pf != $number_of_families) {
597 $self->throw("The number of patterns on frequency array ($pf)\n".
598 "does not fit with the number of haplotype patterns on \n".
599 "haplotype array ($number_of_families)\n");
607 Usage : _do_it($self)
608 Function: Process the input generating the results.
615 #------------------------
617 #------------------------
621 # first we are goinf to define here all variables we are going to use
622 $self -> {'w_hap'} = [];
623 $self -> {'w_pop_freq'} = dclone
( $self ->pattern_freq() );
624 $self -> {'deg_pattern'} = {};
625 $self -> {'snp_type'} = {}; # type of snp on the set. see below
626 $self -> {'alleles_number'} = 0; # number of variations (biallelic,...)
627 $self -> {'snp_type_code'} = [];
628 $self -> {'ht_type'} = []; # store the snp type used on the htSet
629 $self -> {'split_hap'} = [];
630 $self -> {'snp_and_code'} = [];
633 # we classify the SNP under snp_type
634 $self->{snp_type
}->{useful_snp
} = dclone
( $self ->snp_ids() );
635 $self->{snp_type
}->{deg_snp
} = []; # deg snp
636 $self->{snp_type
}->{silent_snp
} = []; # not a real snp
638 # split the haplotype
639 _split_haplo
($self);
641 # first we convert to upper case the haplotype
642 # to make A the same as a for comparison
643 _to_upper_case
( $self -> {w_hap
} );
645 #######################################################
646 # check if any SNP has indetermination. If any SNP has
647 # indetermination this value will be removed.
648 #######################################################
649 _remove_deg
( $self );
651 #######################################################
652 # depending of the families you use some SNPs can be
653 # silent. This silent SNP's are not used on the
654 # creation of tags and has to be skipped from the
656 #######################################################
657 _rem_silent_snp
( $self );
659 #######################################################
660 # for the remaining SNP's we have to check if two
661 # families have the same value. If this is true, the families
662 # will produce the same result and therefore we will not find
663 # any pattern. So, the redundant families need to be take
664 # away from the analysis. But also considered for a further
667 # When we talk about a normal haplotype blocks this situation
668 # makes no sense but if we remove one of the snp because the
669 # degeneration two families can became the same.
670 # these families may be analised on a second round
671 #######################################################
673 _find_deg_pattern
( $self );
675 #################################################################
676 # if the pattern list length is different to the lenght of the w_hap
677 # we can tell that tow columns have been considered as the same one
678 # and therefore we have to start to remove the values.
679 # remove all columns with degeneration
681 # For this calculation we don't use the pattern frequency.
682 # All patterns are the same, This selection makes
683 # sense when you have different frequency.
685 # Note: on this version we don't classify the haplotype by frequency
686 # but if you need to do it. This is the place to do it!!!!
688 # In reality you don't need to sort the values because you will remove
689 # the values according to their values.
691 # But as comes from a hash, the order could be different and as a
692 # consequence the code generate on every run of the same set could
693 # differ. That is not important. In fact, does not matter but could
695 #################################################################
697 my @tmp =sort { $a <=> $b}
698 keys %{$self -> {deg_pattern
}}; # just count the families
700 # if the size of the list is different to the size of the degenerated
701 # family. There is degeneration. And the redundancies will be
703 if($#tmp != $#{$self -> { w_hap } } ){
704 _keep_these_patterns
($self->{w_hap
}, \
@tmp);
705 _keep_these_patterns
($self->{w_pop_freq
}, \
@tmp);
708 #################################################################
709 # the steps made before about removing snp and cluster families
710 # are just needed pre-process the haplotype before.
712 # Now is when the fun starts.
715 # once we have the this minimal matrix, we have to calculate the
716 # max multipliticy for the values. The max number of alleles found
717 # on the set. A normal haplotype is biallelic but we can not
718 # reject multiple variations.
719 ##################################################################
721 _alleles_number
( $self );
723 ##################################################################
724 # Now we have to convert the haplotype into number
731 # one haplotype like this transformed into number produce this result
738 ##################################################################
740 _convert_to_numbers
( $self );
742 ###################################################################
743 # The next step is to calculate the type of the SNP.
744 # This process is made based on the position of the SNP, the value
745 # and its multiplicity.
746 ###################################################################
748 _snp_type_code
( $self );
750 ###################################################################
751 # now we have all information we need to calculate the haplotype
753 ###################################################################
757 ###################################################################
760 # all SNP have a code. but if the SNP is not used this code must
761 # be zero in case of silent SNP. This looks not to informative
762 # because all the information is already there. But this method
763 # compile the full set.
764 ###################################################################
766 _snp_and_code_summary
( $self );
772 Usage : $obj->input_block()
773 Function: returns input block
774 Returns : reference to array of array
780 #------------------------
782 #------------------------
785 return $self -> {input_block
};
791 Usage : $obj->hap_length()
792 Function: get numbers of SNP on the haplotype
799 #------------------------
801 #------------------------
804 return scalar @
{$self -> {'_snp_ids'}};
811 Usage : $obj->pop_freq()
812 Function: returns population frequency
813 Returns : reference to array
819 #------------------------
821 #------------------------
824 return $self -> {pop_freq
}
832 Usage : $obj->deg_snp()
833 Function: returns snp_removes due to indetermination on their values
834 Returns : reference to array
840 #------------------------
842 #------------------------
844 return $self -> {snp_type
} ->{deg_snp
};
852 Usage : $obj->snp_type()
853 Function: returns hash with SNP type
854 Returns : reference to hash
860 #------------------------
862 #------------------------
864 return $self -> {snp_type
};
872 Usage : $obj->silent_snp()
873 Function: some SNP's are silent (not contibuting to the haplotype)
874 and are not considering for this analysis
875 Returns : reference to a array
881 #------------------------
883 #------------------------
885 return $self -> {snp_type
} ->{silent_snp
};
893 Usage : $obj->useful_snp()
894 Function: returns list of SNP's that are can be used as htSNP. Some
895 of them can produce the same information. But this is
897 Returns : reference to a array
903 #------------------------
905 #------------------------
907 return $self -> {snp_type
} ->{useful_snp
};
915 Usage : $obj->ht_type()
916 Function: every useful SNP has a numeric code dependending of its
917 value and position. For a better description see
918 description of the module.
919 Returns : reference to a array
925 #------------------------
927 #------------------------
929 return $self -> {ht_type
};
935 Usage : $obj->ht_set()
936 Function: returns the minimal haplotype in numerical format. This
937 haplotype contains the maximal information about the
938 haplotype variations but with no redundancies. It's the
939 minimal set that describes the haplotype.
940 Returns : reference to an array of arrays
946 #------------------------
948 #------------------------
950 return $self -> {w_hap
};
956 Title : snp_type_code
957 Usage : $obj->snp_type_code()
958 Function: returns the numeric code of the SNPs that need to be
959 tagged that correspond to the SNP's considered in ht_set.
960 Returns : reference to an array
966 #------------------------
968 #------------------------
970 return $self -> {snp_type_code
};
977 Usage : $obj->snp_and_code()
978 Function: Returns the full list of SNP's and the code associate to
979 them. If the SNP belongs to the group useful_snp it keep
980 this code. If the SNP is silent the code is 0. And if the
981 SNP is degenerated the code is -1.
982 Returns : reference to an array of array
988 #------------------------
990 #------------------------
992 return $self -> {'snp_and_code'};
999 Usage : $obj->deg_pattern()
1000 Function: Returns the a list with the degenerated haplotype.
1001 Sometimes due to degeneration some haplotypes looks
1002 the same and if we don't remove them it won't find
1004 Returns : reference to a hash of array
1010 #------------------------
1012 #------------------------
1015 return $self -> {'deg_pattern'};
1022 Usage : $obj->split_hap()
1023 Function: simple representation of the haplotype base by base
1024 Same information that input haplotype but base based.
1025 Returns : reference to an array of array
1031 #------------------------
1033 #------------------------
1035 return $self -> {'split_hap'};
1040 Title : _split_haplo
1041 Usage : _split_haplo($self)
1042 Function: Take a haplotype and split it into bases
1049 #------------------------
1051 #------------------------
1054 my $in = $self ->{'_haplotype_block'};
1055 my $out = $self ->{'w_hap'};
1057 # split every haplotype and store the result into $out
1059 push @
$out, [split (//,$_)];
1062 $self -> {'split_hap'} = dclone
($out);
1065 # internal method to convert the haplotype to uppercase
1068 =head2 _to_upper_case
1071 Title : _to_upper_case
1072 Usage : _to_upper_case()
1073 Function: make SNP or in-dels Upper case
1080 #------------------------
1081 sub _to_upper_case
{
1082 #------------------------
1085 foreach my $aref (@
$arr){
1086 foreach my $value (@
{$aref} ){
1097 Usage : _remove_deg()
1098 Function: when have a indetermination or strange value this SNP
1100 Returns : haplotype family set and degeneration list
1101 Args : ref to an AoA and a ref to an array
1106 #------------------------
1108 #------------------------
1111 my $hap = $self->{w_hap
};
1112 my $snp = $self->{snp_type
}->{useful_snp
};
1113 my $deg_snp = $self->{snp_type
}->{deg_snp
};
1115 my $rem = []; # take the position of the array to be removed
1117 # first we work on the columns we have void values
1118 $rem = _find_indet
($hap,$rem); # find degenerated columns
1122 # remove column on haplotype
1123 _remove_col
($hap,$rem); # remove list
1125 # now remove the values from SNP id
1126 _remove_snp_id
($snp,$deg_snp,$rem); # remove list
1131 =head2 _rem_silent_snp
1134 Title : _rem_silent_snp
1135 Usage : _rem_silent_snp()
1136 Function: there is the remote possibilty that one SNP won't be a
1137 real SNP on this situation we have to remove this SNP,
1138 otherwise the program won't find any tag
1140 Args : ref to an AoA and a ref to an array
1145 #------------------------
1146 sub _rem_silent_snp
{
1147 #------------------------
1150 my $hap = $self->{w_hap
};
1151 my $snp = $self->{snp_type
}->{useful_snp
};
1152 my $silent_snp = $self->{snp_type
}->{silent_snp
};
1154 my $rem = []; # store the positions to be removed
1156 #find columns with no variation on the SNP, Real snp?
1157 $rem = _find_silent_snps
($hap);
1161 # remove column on haplotype
1162 _remove_col
($hap,$rem);
1164 # remove the values from SNP id
1165 _remove_snp_id
($snp,$silent_snp,$rem);
1170 =head2 _find_silent_snps
1173 Title : _find_silent_snps
1175 Function: list of snps that are not SNPs. All values for that
1176 SNPs on the set is the same one. Look stupid but can
1177 happend and if this happend you will not find any tag
1184 #------------------------
1185 sub _find_silent_snps
{
1186 #------------------------
1189 my $list =[]; # no snp list;
1191 # determine the number of snp by the length of the first row.
1192 # we assume that the matrix is squared.
1193 my $colsn= @
{$arr->[0]};
1195 for (my $i=0;$i<$colsn;$i++){
1196 my $different =0; # check degeneration
1198 for my $r (1..$#$arr){
1199 if($arr->[0][$i] ne $arr->[$r][$i]){
1219 Function: find column (SNP) with invalid or degenerated values
1220 and store this values into the second parameter supplied.
1222 Args : ref to AoA and ref to an array
1227 #------------------------
1229 #------------------------
1230 my ($arr, $list)=@_;
1232 foreach my $i(0..$#$arr){
1233 foreach my $j(0..$#{$arr->[$i]}){
1234 unless ($arr->[$i][$j] =~ /[ACTG-]/){
1239 my $found =0; # check if already exist the value
1240 foreach my $k(0..$#$list){
1241 $found =1 if ($list->[$k] eq $j);
1252 @
$list = sort { $a <=> $b} @
$list;
1261 Function: remove columns contained on the second array from
1264 Args : array of array reference and array reference
1269 #------------------------
1271 #------------------------
1274 foreach my $col (reverse @
$rem){
1275 splice @
$_, $col, 1 for @
$arr;
1280 =head2 _remove_snp_id
1282 Title : _remove_snp_id
1284 Function: remove columns contained on the second array from
1287 Args : array of array reference and array reference
1292 #------------------------
1294 #------------------------
1295 my ($arr,$removed,$rem_list)=@_;
1297 push @
$removed, splice @
$arr, $_, 1 foreach reverse @
$rem_list;
1301 =head2 _find_deg_pattern
1303 Title : _find_deg_pattern
1305 Function: create a list with the degenerated patterns
1312 #------------------------
1313 sub _find_deg_pattern
{
1314 #------------------------
1317 my $arr = $self ->{w_hap
}; # the working haplotype
1318 my $list = $self ->{'deg_pattern'}; # degenerated patterns
1320 # we have to check all elements
1321 foreach my $i(0..$#$arr){
1322 # is the element has not been used create a key
1323 unless ( _is_on_hash
($list,\
$i) ) {
1327 foreach my $j($i+1..$#$arr){
1328 my $comp = compare_arrays
($arr->[$i],$arr->[$j]);
1331 # as we have no elements we push this into the list
1332 # check for the first element
1333 my $key = _key_for_value
($list,\
$i);
1335 push (@
{$list->{$key}},$j);
1344 #------------------------
1346 #------------------------
1347 my($hash,$value)=@_;
1349 foreach my $key (keys %$hash){
1350 if( _is_there
(\@
{$hash->{$key}},$value)){
1356 #------------------------
1358 #------------------------
1359 my($hash,$value)=@_;
1361 foreach my $key (keys %$hash){
1362 if( _is_there
(\@
{$hash->{$key}},$value)){
1368 #------------------------
1370 #------------------------
1374 foreach my $el (@
$arr){
1375 if ($el eq $$value){
1382 =head2 _keep_these_patterns
1385 Title : _keep_these_patterns
1387 Function: this is a basic approach, take a LoL and a list,
1388 keep just the columns included on the list
1390 Args : an AoA and an array
1395 #------------------------
1396 sub _keep_these_patterns
{
1397 #------------------------
1400 # by now we just take one of the repetitions but you can weight
1401 # the values by frequency
1405 foreach my $k (@
$list){
1406 push @outValues, $arr->[$k];
1409 #make arr to hold the new values
1410 @
$arr= @
{dclone
(\
@outValues)};
1415 =head2 compare_arrays
1418 Title : compare_arrays
1420 Function: take two arrays and compare their values
1421 Returns : 1 if the two values are the same
1422 0 if the values are different
1423 Args : an AoA and an array
1428 #------------------------
1429 sub compare_arrays
{
1430 #------------------------
1431 my ($first, $second) = @_;
1432 return 0 unless @
$first == @
$second;
1433 for (my $i = 0; $i < @
$first; $i++) {
1434 return 0 if $first->[$i] ne $second->[$i];
1440 =head2 _convert_to_numbers
1443 Title : _convert_to_numbers
1444 Usage : _convert_to_numbers()
1445 Function: tranform the haplotype into numbers. before to do that
1446 we have to consider the variation on the set.
1448 Args : ref to an AoA and a ref to an array
1453 #------------------------
1454 sub _convert_to_numbers
{
1455 #------------------------
1458 my $hap_ref = $self->{w_hap
};
1459 my $mm = $self->{alleles_number
};
1461 # the first element is considered as zero. The first modification
1462 # is consider as one and so on.
1464 my $length = @
{ @
$hap_ref[0]}; #length of the haplotype
1466 for (my $c = 0; $c<$length;$c++){
1470 for my $r (0..$#$hap_ref){
1472 push @al,$hap_ref->[$r][$c]
1473 unless _is_there
(\
@al,\
$hap_ref->[$r][$c]);
1475 $hap_ref->[$r][$c] = get_position
(\
@al,\
$hap_ref->[$r][$c]);
1481 =head2 _snp_type_code
1484 Title : _snp_type_code
1487 we have to create the snp type code for each version.
1488 The way the snp type is created is the following:
1490 we take the number value for every SNP and do the
1491 following calculation
1493 let be a SNP set as follow:
1500 on this case the situation is:
1502 sum (value * multiplicity ^ position) for each SNP
1504 0 * 3 ^ 0 + 1 * 3 ^ 1 + 1 * 3 ^ 2 = 12
1505 0 * 3 ^ 0 + 1 * 3 ^ 1 + 2 * 3 ^ 2 = 21
1512 #------------------------
1514 #------------------------
1517 my $hap = $self->{w_hap
};
1518 my $arr = $self->{snp_type_code
};
1519 my $al = $self->{alleles_number
};
1521 my $length = @
{ $hap->[0]}; #length of the haplotype
1523 for (my $c=0; $c<$length; $c++){
1524 for my $r (0..$#$hap){
1525 $arr->[$c] += $hap->[$r][$c] * $al ** $r;
1530 #################################################
1531 # return the position of an element in one array
1532 # The element is always present on the array
1533 #################################################
1535 #------------------------
1537 #------------------------
1539 my($array, $value)=@_;
1541 for my $i(0..$#$array) {
1542 if ($array->[$i] eq $$value){
1550 =head2 _alleles_number
1553 Title : _alleles_number
1555 Function: calculate the max number of alleles for a haplotype and
1556 if the number. For each SNP the number is stored and the
1557 max number of alleles for a SNP on the set is returned
1558 Returns : max number of alleles (a scalar storing a number)
1564 #------------------------
1565 sub _alleles_number
{
1566 #------------------------
1570 my $hap_ref = $self ->{w_hap
}; # working haplotype
1572 my $length = @
{ @
$hap_ref[0]}; # length of the haplotype
1574 for (my $c = 0; $c<$length;$c++){
1578 for my $r (0..$#$hap_ref){
1579 $alleles{ $hap_ref->[$r][$c] } =1; # new key for every new snp
1582 # if the number of alleles for this column is
1583 # greater than before set $m value as allele number
1584 if ($self->{alleles_number
} < keys %alleles) {
1585 $self->{alleles_number
} = keys %alleles;
1596 Function: calculate the minimal set that contains all information of the
1599 Args : ref to an AoA and a ref to an array
1604 #------------------------
1606 #------------------------
1609 my $hap = $self->{'w_hap'};
1610 my $type = $self->{'snp_type_code'};
1611 my $set = $self->{'ht_type'};
1612 my $out = []; # store the minimal set
1614 my $nc=0; # new column for the output values
1616 # pass for every value of the snp_type_code
1617 for my $c (0..$#$type){
1621 # every new value (not present) is pushed into set
1622 if ( ! _is_there
( $set,\
$type->[$c] ) ){
1623 push @
$set, $type->[$c];
1627 for my $r(0..$#$hap){
1628 #save value of the snp for every SNP
1629 $out->[$r][$nc]= $hap->[$r][$c];
1633 if ($exist){ $nc++ };
1636 @
$hap = @
{dclone
$out};
1639 =head2 _snp_and_code_summary
1641 Title : _snp_and_code_summary
1642 Usage : _snp_and_code_summary()
1643 Function: compile on a list all SNP and the code for each. This
1644 information can be also obtained combining snp_type and
1645 snp_type_code but on these results the information about
1646 the rest of SNP's are not compiled as table.
1648 0 will be silent SNPs
1649 -1 are degenerated SNPs
1650 and the rest of positive values are the code for useful SNP
1653 Args : ref to an AoA and a ref to an array
1658 #------------------------
1659 sub _snp_and_code_summary
{
1660 #------------------------
1663 my $snp_type_code = $self->{'snp_type_code'};
1664 my $useful_snp = $self->{'snp_type'}->{'useful_snp'};
1665 my $silent_snp = $self->{'snp_type'}->{'silent_snp'};
1666 my $deg_snp = $self->{'snp_type'}->{'deg_snp'};
1667 my $snp_ids = $self->snp_ids();
1668 my $snp_and_code = $self->{'snp_and_code'};
1670 # walk all SNP's and generate code for each
1672 # do a practical thing. Consider all snp silent
1673 foreach my $i (0..$#$snp_ids){
1675 # assign zero to silent
1679 foreach my $j (0..$#$useful_snp){
1680 if ($snp_ids->[$i] eq $useful_snp->[$j]){
1681 $value = $snp_type_code->[$j];
1686 # assign -1 to degenerated
1687 foreach my $j (0..$#$deg_snp){
1688 if ($snp_ids->[$i] eq $deg_snp->[$j]){
1694 push @
$snp_and_code, [$snp_ids->[$i], $value];