1 =-------------------------------------22/07/09-------------------------------------=
2 To get the total number of reads for each sample I created a bash script called reads
3 Usage: reads [options] [file]
4 It then outputs the total reads
6 -> For total reads per sample
7 clustered_tags_db2h_me_ja_2.txt
10 clustered_tags_DB2.txt
13 clustered_tags_SC3.txt
17 -> For single reads per sample
18 clustered_tags_db2h_me_ja_2.txt
21 clustered_tags_DB2.txt
24 clustered_tags_SC3.txt
28 =-------------------------------------23/07/09-------------------------------------=
29 To compare the different samples I needed to install R.
30 Installed R - Statistical Lanugage on my mac
32 Installed an external library called diffGeneAnalysis from bioconductor.org
33 link: http://www.bioconductor.org/packages/release/bioc/html/diffGeneAnalysis.html
34 Downloaded the MacOSX Binary version
35 To install the package I ran the following command:
36 R CMD INSTALL diffGeneAnalysis
38 Installed a dependency of diffGeneAnalysis ~ minpack.lm
39 link: http://cran.r-project.org/web/packages/minpack.lm/index.html
40 Downloaded the source version as the binary version did not work on 10.5 leopard
41 To install the package I rand the following command:
42 R CMD INSTALL -l /opt/local/lib/R/library minpack.lm
44 =-------------------------------------27/07/09-------------------------------------=
45 Met with Urmi to talk about how to find the differentially expressed genes in
48 Was told to first compare two data files:
49 - 1. Compare the tags on both data files
50 - 2. If match, then output their number of reads as such:
54 But Urmi didn't tell me how to normalize the data to be able to find the
55 differentially expressed genes. She did give me a research paper to look through,
56 apparently the paper has the same methods I need to do this.
59 =-------------------------------------28/07/09-------------------------------------=
60 Previous R packages useless, because the packages are for microarray analysis.
62 Creating Bash script to read through and compare two data files.
64 The bash script was too slow.
66 =-------------------------------------29/07/09-------------------------------------=
67 Writing a C program to replace the bash script "common_tags"
70 =-------------------------------------30/07/09-------------------------------------=
71 C program algorithm is too slow, it took 5 hours to sort through 2 data files.
72 Writing Perl script to replace the C version of "common_tags"
74 =-------------------------------------31/07/09-------------------------------------=
75 Perl script of "common_tags" sorts and outputs the data in 2~3 seconds.
76 The time it took me to get the programming right took me a while to go through.
77 The algorithm used in the C program was too expensive(computer resource wise),
78 I should have used a hash table to sort the table. But because C doesn't have
79 an in-built hash function I chose Perl instead.
81 Still thinking of how to normalize the data.
83 =-------------------------------------01/08/09-------------------------------------=
84 Decided to do some Student T-tests on the samples using R, and I got the following
87 For clustered_tags_DB2.txt:
88 --------------------------------------------
89 > DB2_data<-read.table("clustered_tags_DB2.txt",header=F);
98 t = 24.5161, df = 117080, p-value < 2.2e-16
99 alternative hypothesis: true mean is not equal to 0
100 95 percent confidence interval:
107 For clustered_tags_SC3.txt:
108 --------------------------------------------
109 > SC3_data<-read.table("clustered_tags_SC3.txt",header=F);
118 t = 46.7748, df = 395165, p-value < 2.2e-16
119 alternative hypothesis: true mean is not equal to 0
120 95 percent confidence interval:
127 For clustered_tags_db2h_me_ja_2.txt:
128 --------------------------------------------
129 > MeJa_data<-read.table("clustered_tags_db2h_me_ja_2.txt", header=F);
138 t = 36.8054, df = 225738, p-value < 2.2e-16
139 alternative hypothesis: true mean is not equal to 0
140 95 percent confidence interval:
147 =-------------------------------------13/08/09-------------------------------------=
148 Disregarding all previous data, Urmi uploaded new set of data. Data placed in /data
151 Created a analysis.rb script that has the functionalities of common_tags, it also
152 has the ability to normalize the data. According to Urmi the algorithm is simply:
153 Number of tags / Total number of tags (per million)
154 The analysis.rb script will aim to extend the script to be able to do student's
155 t-test, and also much more...
157 Meeting with Urmi tomorrow @ 11AM.
160 =-------------------------------------13/08/09-------------------------------------=
161 Meeting notes with Urmi.
162 Talked about the possibility of finding out the Differentially Expressed gene, by
163 printing out a giant table.
165 =-------------------------------------25/08/09-------------------------------------=
166 Managed to code the analysis.rb to print out the normalized data in one giant table
167 as suggested by Urmi.
169 Made Differentiall Expressed Genes table.
175 All Results - One Giant Table
177 =-------------------------------------26/08/09-------------------------------------=
178 Urmi uploaded revision for:
183 previous version of the above mentioned data are depreciated....
185 Revised all data, printed out a Giant table of all the normalized data to find the
186 differentially expressed genes.
190 =-------------------------------------31/08/09-------------------------------------=
191 The giant table I created was completely wrong, the approach was completely wrong.
192 I used the file with the most tags assuming that all the other files would have the
193 same ones covered there. However as Thomas pointed out, the results in DB2 were not
194 right, cause they were all zero. Which means that there were tags which the "largest
195 tag file" didn't have. This is very dangerous and can potentially hide important
196 data crucial to Thomas's work.
198 Attempting to fix the problem mentioned above, and also automate the code so its
201 =-------------------------------------05/09/09-------------------------------------=
202 Managed to automate comparing the data and printing it onto a giant table. The way
203 it works is it first obtains all the tags by reading through it, store it in a hash
204 as tag_list, then storing all the data file's tags and reads in hashes, compare them
205 with the tag_list, and print in out in a giant table. The order of which the reads
206 are displayed are deteremined by the "order" file.
208 Got some thing to work at last! :)
211 =------------------------------------12/09/09-------------------------------------=
212 Optimized the code a little, nearly lost some of the work cause of vim playing up...
213 Added some more comments to explain what the code was doing. Am anticipating the
214 up comming results for the solexa data, and do the student's t-tests on them.