TODO

   1 GREYLAG TODO LIST                                               -*-outline-*-
   2
   3
   4 ==============================================================================
   5
   6 OVERALL GOALS
   7
   8 1.  Replace SEQUEST at SIMR with something at least as good.
   9 2.  Do better than SEQUEST for things that SIMR cares about.
  10 3.  Showcase Python w/C-ish inner loop code implementation strategy.
  11 4.  Try to take the best ideas from other similar programs.
  12 5.  Greylag as a pedagogical artifact and foundation for further
  13     experimentation.
  14
  15 ==============================================================================
  16
  17
  18 MILESTONE M1:
  19
  20 * Good first impression
  21 * Basic correctness
  22 * Handles at least LCQ input
  23 * Nonspecific cleavage only?
  24 * Generates SQT output, usable in our pipeline, at least for non-N15 runs
  25 * Decent performance/efficiency on our clusters
  26
  27 MILESTONE M2:
  28
  29 * Basic public website/source release/git archive
  30 * Documentation (asciidoc/man pages)
  31
  32
  33
  34 TASK QUEUE
  35
  36 * MINI-GOAL: Get something working that can be tested against MM/SEQUEST
  37 * MINI-GOAL: basic greylag-process usable on our cluster (no mods)
  38
  39
  40 * Demo runs! (vs SEQUEST)
  41 ** observe memory use!
  42 ** no mod
  43 ** one mod
  44 ** ASFP (multiple mods)
  45 ** vs myrimatch, if run completes
  46
  47 * Write release email
  48 ** What it does and doesn't yet do
  49 ** How to use it
  50
  51 * Change the way mod limit works?  ({1,4} feature?)
  52
  53 * Why aren't scores a little more stable?  (fast-math?)
  54
  55 * add A1S2D1C3A1S2D1F4 marking form?
  56
  57 * Update docstrings
  58
  59 * Evaluate performance differences vs SEQUEST/MM/Xtandem?
  60
  61 * Design and implement greylag master process (work manifests?)
  62
  63 * Compare greylag/SEQUEST/MM on test-myrimatch example (non-specific)
  64 ** look at MM (whole file)
  65 ** Note: SEQUEST parent tolerance differs
  66
  67 * Basic optimization
  68 ** look at callgrind output
  69
  70 * Further test case updates/adds
  71 ** enzymatic cleavage
  72
  73
  74 * Verify that memory usage no longer a problem
  75 ** maybe avoid spectrum name copies
  76 ** maybe avoid locus name copies
  77 ** instead of copying db sequences, use Python's?
  78
  79
  80 = M1 =========================================================================
  81
  82
  83
  84 * do a cg-admin-rewritehist before publishing git archive?
  85
  86 * calculate Ion%?
  87
  88
  89 = M2 =========================================================================
  90
  91 * Implement MM smart +3 model?
  92 ** Is it better?
  93
  94 * Try to generate a valid MyriMatch (bombs on boost random assertion)
  95
  96 * Try the MM precursor mass adjustment--much improvement?  even a good idea?
  97
  98 * Test tolerance monotonicity
  99
 100
 101 * Examine DBValidate
 102 ** Design similar statistical evaluation
 103 ** Look at what we do here (paper)
 104
 105
 106 * Add isotope jitter feature, for Orbitrap.
 107
 108   xtandem considers one C13 if MH>1000, and one/two C13 if MH>1500.  Should we
 109   try to predict this based on the peptide sequence?  MH probably close
 110   enough.  What does MyriMatch do?
 111
 112
 113 * Implement MyriMatch charge-calling algorithm?
 114
 115 * Implement MyriMatch deisotoping?
 116
 117
 118 * Look more closely at specific MM vs GL match diffs.
 119
 120
 121 * Figure out how to handle multiple residue mods (delta, isotope, etc)
 122
 123 * Clean up [,] (N/C-terminal) mass regime calculations
 124
 125 * Pass through the C++ code looking for counts that could conceivably overflow
 126 ** Fix or add assertions
 127
 128 * Add duplicate peptide masking optimization
 129 ** This will obviate the need to detect identical best matches at search time?
 130 ** Fix redundant peptide reporting
 131
 132
 133 * Time a real mod search vs SEQUEST (and xtandem?), is time reasonable?
 134   MAKE SURE PARAMS ARE COMPARABLE!  Ballpark correctness?
 135
 136 * Create direct DTASelect.txt output?
 137   (This seems to be sufficient to support most or all DTASelect output.)
 138
 139 * Make a tool to compare greylag vs SEQUEST results by spectrum.  Want gross
 140   statistics--how many id's are the same, different, missing, etc.  For each
 141   spectrum, want to see what each program did, and how many times the assigned
 142   locus was otherwise id'ed.
 143
 144 * Investigate identification differences between greylag and SEQUEST.
 145
 146 * Careful timing and correctness check for
 147   /n/proteomics/mkc/HsProA-Control_S100_Ti_1_H_2006-03-03_wSHUFFLED-greylag
 148
 149 * Design and implement tracing of mass regime/PCA/fixed and non-fixed
 150   deltas/etc into output file.  Try to stay compatible with xtandem.
 151
 152 * PPM error tolerances (MyriMatch doesn't implement this?)
 153
 154 * Make --estimate work correctly over cluster.  (Currently takes 6 hours to
 155   estimate 60--is this worthwhile?  Could we simply estimate one bag and
 156   multiply by the number of bags??)
 157
 158 * Better shuffling than current model.
 159
 160 * Useful to scale fragment tolerance by charge, too?
 161
 162 * Have --estimate generate a spectrum work count file (*.est?) that can be
 163   used by --part-split to generate evenly sized parts.  (Check that file is
 164   newer than params file and ms2 file arguments, and that all ms2 file
 165   arguments were estimated.)
 166
 167 * Maybe --part-split should generate a downramp of sizes?  It definitely
 168   should take into account spectra filtered out (== no work), but this
 169   requires reading all spectra before splitting (which takes more time).
 170
 171 * Fix "cleavage C-terminal mass change" issue.  Should this be interpreted as
 172   MONO, ! (first fragment regime), or by regime.  Look for similar problems
 173   elsewhere.
 174
 175 * Make static '[' mod exclude PCA mods.
 176
 177 * Code cleanup, especially in new Python code.  Maybe put some stuff in
 178   classes.  Could split into multiple source files.
 179
 180 * Mine OMSSA and myrimatch for ideas.  Look again at X!Tandem and SEQUEST
 181   papers.
 182
 183 * Need tool to compare two runs, for regression testing purposes.
 184
 185 * Add refinement.  (like xtandem?)
 186
 187 * Advanced refinement ideas.  For example, only search a locus for a hit with N
 188   mods if we got a hit for it with 0..N-1 mods (or maybe 0..N-2?).  Or, only
 189   search a locus non-tryptically (or semi-tryptically) if we got a tryptic hit
 190   for it.
 191
 192 ** Investigate current SEQUEST search results to see if this looks feasible.
 193
 194 * Think about ways to get more id's per hour of processing time.
 195
 196 * Try to adapt to instrument accuracy.  Maybe start with a narrow parent mass
 197   range and adaptively widen it.
 198
 199 * Profiling to find slow spots, and for correctness?
 200
 201 * Heavy optimization on inner loop.  Try running from both ends simultaneously.
 202
 203 * Should we try to guarantee that searching is equivalent for all mass regimes,
 204   to make comparisons more valid?
 205
 206 * Eval speedup: make intensity information integer, or otherwise store it in
 207   log form so we can add instead of multiplying?
 208
 209 * Try to switch FP code to use integers instead?
 210
 211 * Rigorously check all values coming in from Python (at least by assert).
 212
 213 * We can now pre-build a peptide index if we want to.  The main utility of this
 214   is that it would allow us to avoid searching a spectrum against the same
 215   peptide multiple times (saving perhaps 30% runtime for one real database).
 216   Alternatively, maybe we could just generate a description of peptides to be
 217   masked out.
 218
 219 * Look at moving C++ code to C+ctypes, or maybe pyrex?
 220
 221 * Incrementalize the whole program.  Want to be able to take an existing run
 222   and spend more time on it to get more results, possibly concentrating on a
 223   particular kind of modification.
 224
 225 * Try to figure out whether SEQUEST is really searching everything, or whether
 226   it gives up in certain cases like X!Tandem does.
 227
 228 * Isotope S34 and C13 are common (4%, 1%).  Is there a good way to look for
 229   them?  We could look for singleton occurrences pretty cheaply using a delta
 230   mod type procedure.
 231
 232 * Splitting idea: Rather than having all parts be equal, maybe its better for
 233   the parts processed first to be bigger, with smallest parts processed last,
 234   so that they can fill in the final gaps (leading all processors to finish at
 235   about the same time).  What should the split curve look like?  Linear, but
 236   what slope?  No split should be smaller than one spectrum (after
 237   filtering).
 238
 239 * Possible generation optimization: Figure out the maximum number of mods,
 240   which would be the number that could be added to the smallest, lightest
 241   peptide without exceeding the mass of the largest spectrum parent mass.
 242   Probably not worth doing?  Similarly, if we know all bags of size N are too
 243   large, and all deltas are positive, we can skip larger bags.
 244
 245 * Can we make the parts restartable?  If so, maybe this could be used to load
 246   balance, recover from crashes, etc.
 247
 248 * Is NOTHROW faster or slower?
 249
 250 * If we see a good hit for a spectrum, we could try to see if there's an
 251   identifiable tag.  If so, could restrict further searching to peptides with
 252   that tag.
 253
 254
 255 TO FILE:
 256
 257 * What is this C+57 mod called?  Carboxyamidomethyl?  +C2H3ON!
 258
 259 * Is there anything we can do with neutral losses?
 260
 261 * test spectrum synthesis
 262
 263 * test semi-tryptic cleavage
 264
 265 * double-check handling of FP arithmetic using epsilons (no ==, no strict <)