bio-intern-09/notes.txt

   1 =-------------------------------------22/07/09-------------------------------------=
   2 To get the total number of reads for each sample I created a bash script called reads
   3 Usage: reads [options] [file]
   4 It then outputs the total reads
   5
   6 -> For total reads per sample
   7 clustered_tags_db2h_me_ja_2.txt
   8 4311502
   9
  10 clustered_tags_DB2.txt
  11 3104929
  12
  13 clustered_tags_SC3.txt
  14 6470221
  15
  16
  17 -> For single reads per sample
  18 clustered_tags_db2h_me_ja_2.txt
  19 451478
  20
  21 clustered_tags_DB2.txt
  22 234162
  23
  24 clustered_tags_SC3.txt
  25 790332
  26
  27
  28 =-------------------------------------23/07/09-------------------------------------=
  29 To compare the different samples I needed to install R.
  30 Installed R - Statistical Lanugage on my mac
  31
  32 Installed an external library called diffGeneAnalysis from bioconductor.org
  33 link: http://www.bioconductor.org/packages/release/bioc/html/diffGeneAnalysis.html
  34 Downloaded the MacOSX Binary version
  35 To install the package I ran the following command:
  36 R CMD INSTALL diffGeneAnalysis
  37
  38 Installed a dependency of diffGeneAnalysis ~ minpack.lm
  39 link: http://cran.r-project.org/web/packages/minpack.lm/index.html
  40 Downloaded the source version as the binary version did not work on 10.5 leopard
  41 To install the package I rand the following command:
  42 R CMD INSTALL -l /opt/local/lib/R/library minpack.lm
  43
  44 =-------------------------------------27/07/09-------------------------------------=
  45 Met with Urmi to talk about how to find the differentially expressed genes in
  46 the data.
  47
  48 Was told to first compare two data files:
  49 - 1. Compare the tags on both data files
  50 - 2. If match, then output their number of reads as such:
  51 Tag           Reads_1   Reads_2
  52 ACGTACGTGAC   1331      2144
  53
  54 But Urmi didn't tell me how to normalize the data to be able to find the
  55 differentially expressed genes. She did give me a research paper to look through,
  56 apparently the paper has the same methods I need to do this.
  57
  58
  59 =-------------------------------------28/07/09-------------------------------------=
  60 Previous R packages useless, because the packages are for microarray analysis.
  61
  62 Creating Bash script to read through and compare two data files.
  63
  64 The bash script was too slow.
  65
  66 =-------------------------------------29/07/09-------------------------------------=
  67 Writing a C program to replace the bash script "common_tags"
  68
  69
  70 =-------------------------------------30/07/09-------------------------------------=
  71 C program algorithm is too slow, it took 5 hours to sort through 2 data files.
  72 Writing Perl script to replace the C version of "common_tags"
  73
  74 =-------------------------------------31/07/09-------------------------------------=
  75 Perl script of "common_tags" sorts and outputs the data in 2~3 seconds.
  76 The time it took me to get the programming right took me a while to go through.
  77 The algorithm used in the C program was too expensive(computer resource wise),
  78 I should have used a hash table to sort the table. But because C doesn't have
  79 an in-built hash function I chose Perl instead.
  80
  81 Still thinking of how to normalize the data.
  82
  83 =-------------------------------------01/08/09-------------------------------------=
  84 Decided to do some Student T-tests on the samples using R, and I got the following
  85 results:
  86
  87 For clustered_tags_DB2.txt:
  88 --------------------------------------------
  89 > DB2_data<-read.table("clustered_tags_DB2.txt",header=F);
  90 > attach(DB2_data);
  91 > names(DB2_data);
  92 [1] "V1" "V2"
  93 > t.test(V2,y=NULL);
  94
  95         One Sample t-test
  96
  97 data:  V2
  98 t = 24.5161, df = 117080, p-value < 2.2e-16
  99 alternative hypothesis: true mean is not equal to 0
 100 95 percent confidence interval:
 101  24.39935 28.63964
 102 sample estimates:
 103 mean of x
 104  26.51950
 105
 106
 107 For clustered_tags_SC3.txt:
 108 --------------------------------------------
 109 > SC3_data<-read.table("clustered_tags_SC3.txt",header=F);
 110 > attach(SC3_data);
 111 > names(SC3_data);
 112 [1] "V1" "V2"
 113 > t.test(V2,y=NULL);
 114
 115         One Sample t-test
 116
 117 data:  V2
 118 t = 46.7748, df = 395165, p-value < 2.2e-16
 119 alternative hypothesis: true mean is not equal to 0
 120 95 percent confidence interval:
 121  15.68734 17.05951
 122 sample estimates:
 123 mean of x
 124  16.37343
 125
 126
 127 For clustered_tags_db2h_me_ja_2.txt:
 128 --------------------------------------------
 129 > MeJa_data<-read.table("clustered_tags_db2h_me_ja_2.txt", header=F);
 130 > attach(MeJa_data);
 131 > names(MeJa_data);
 132 [1] "V1" "V2"
 133 > t.test(V2,y=NULL);
 134
 135         One Sample t-test
 136
 137 data:  V2
 138 t = 36.8054, df = 225738, p-value < 2.2e-16
 139 alternative hypothesis: true mean is not equal to 0
 140 95 percent confidence interval:
 141  18.08241 20.11659
 142 sample estimates:
 143 mean of x
 144   19.0995
 145
 146
 147 =-------------------------------------13/08/09-------------------------------------=
 148 Disregarding all previous data, Urmi uploaded new set of data. Data placed in /data
 149 folder.
 150
 151 Created a analysis.rb script that has the functionalities of common_tags, it also
 152 has the ability to normalize the data. According to Urmi the algorithm is simply:
 153 Number of tags / Total number of tags (per million)
 154 The analysis.rb script will aim to extend the script to be able to do student's
 155 t-test, and also much more...
 156
 157 Meeting with Urmi tomorrow @ 11AM.
 158
 159
 160 =-------------------------------------13/08/09-------------------------------------=
 161 Meeting notes with Urmi.
 162 Talked about the possibility of finding out the Differentially Expressed gene, by
 163 printing out a giant table.
 164
 165 =-------------------------------------25/08/09-------------------------------------=
 166 Managed to code the analysis.rb to print out the normalized data in one giant table
 167 as suggested by Urmi.
 168
 169 Made Differentiall Expressed Genes table.
 170
 171 DB&SC
 172 DB_MeJa@0_5h~2h
 173 DB_MeJa2@2h~12h
 174 DB_MeJa3@0_5h~12h
 175 All Results - One Giant Table
 176
 177 =-------------------------------------26/08/09-------------------------------------=
 178 Urmi uploaded revision for:
 179 DB_0_5h_MeJa
 180 DB_2h_MeJa
 181 SC1
 182 SC3
 183 previous version of the above mentioned data are depreciated....
 184
 185 Revised all data, printed out a Giant table of all the normalized data to find the
 186 differentially expressed genes.
 187
 188
 189
 190 =-------------------------------------31/08/09-------------------------------------=
 191 The giant table I created was completely wrong, the approach was completely wrong.
 192 I used the file with the most tags assuming that all the other files would have the
 193 same ones covered there. However as Thomas pointed out, the results in DB2 were not
 194 right, cause they were all zero. Which means that there were tags which the "largest
 195 tag file" didn't have. This is very dangerous and can potentially hide important
 196 data crucial to Thomas's work.
 197
 198 Attempting to fix the problem mentioned above, and also automate the code so its
 199 smarter!
 200
 201 =-------------------------------------05/09/09-------------------------------------=
 202 Managed to automate comparing the data and printing it onto a giant table. The way
 203 it works is it first obtains all the tags by reading through it, store it in a hash
 204 as tag_list, then storing all the data file's tags and reads in hashes, compare them
 205 with the tag_list, and print in out in a giant table. The order of which the reads
 206 are displayed are deteremined by the "order" file.
 207
 208 Got some thing to work at last! :)
 209
 210
 211 =------------------------------------12/09/09-------------------------------------=
 212 Optimized the code a little, nearly lost some of the work cause of vim playing up...
 213 Added some more comments to explain what the code was doing. Am anticipating the
 214 up comming results for the solexa data, and do the student's t-tests on them.
 215
 216
 217
 218
 219
 220
 221
 222
 223
 224
 225
 226
 227
 228