TODO

   1 GREYLAG TODO LIST                                               -*-outline-*-
   2
   3
   4 ==============================================================================
   5
   6 OVERALL GOALS
   7
   8 1.  Replace SEQUEST at SIMR with something at least as good.
   9 2.  Do better than SEQUEST for things that SIMR cares about.
  10 3.  Showcase Python w/C-ish inner loop code implementation strategy.
  11 4.  Try to take the best ideas from other similar programs.
  12 5.  Greylag as a pedagogical artifact and foundation for further
  13     experimentation.
  14
  15 ==============================================================================
  16
  17
  18 MILESTONE M1:
  19
  20 * Good first impression
  21 * Basic correctness
  22 * Handles at least LCQ input
  23 * Nonspecific cleavage only?
  24 * Generates SQT output, usable in our pipeline, at least for non-N15 runs
  25 * Decent performance/efficiency on our clusters
  26
  27 MILESTONE M2:
  28
  29 * Basic public website/source release/git archive
  30 * Documentation (asciidoc/man pages)
  31
  32
  33
  34 TASK QUEUE
  35
  36 * MINI-GOAL: Get something working that can be tested against MM/SEQUEST
  37 * MINI-GOAL: basic greylag-process usable on our cluster (no mods)
  38
  39
  40 * Basic optimization
  41 ** look at callgrind output
  42
  43 * update estimate factor
  44
  45 * Update docstrings
  46 * Update test cases
  47
  48 * Compare greylag/SEQUEST/MM on test-myrimatch example (non-specific)
  49 ** whole file
  50 ** Note: SEQUEST parent tolerance differs
  51
  52 * Evaluate performance differences vs SEQUEST/MM/Xtandem?
  53
  54 * Design and implement greylag master process (work manifests?)
  55
  56 * Look for more dead code to remove
  57
  58 * Redo cleavage point code (enzyme and non-specific)
  59
  60 * Mod/regime info in SQT output program (greylag-sqt)
  61 ** R lines to document mass regimes
  62 ** A lines for each M line to document regime (index of R) and residue mods
  63 ** Try to keep back compatibility by grepping out A/R lines
  64
  65 * Look at memory usage
  66 ** maybe avoid spectrum name copies
  67 ** maybe avoid locus name copies
  68 ** instead of copying db sequences, use Python's?
  69 ** kill off cleavage point lists (4M?)
  70
  71
  72 = M1 =========================================================================
  73
  74 * greylag-solo
  75
  76 * Implement MM smart +3 model?
  77 ** Is it better?
  78
  79 * Try to generate a valid MyriMatch (bombs on boost random assertion)
  80
  81 * Try the MM precursor mass adjustment--much improvement?  even a good idea?
  82
  83 * Test tolerance monotonicity
  84
  85
  86 * Examine DBValidate
  87 ** Design similar statistical evaluation
  88 ** Look at what we do here (paper)
  89
  90
  91 * Add isotope jitter feature, for Orbitrap.
  92
  93   xtandem considers one C13 if MH>1000, and one/two C13 if MH>1500.  Should we
  94   try to predict this based on the peptide sequence?  MH probably close
  95   enough.  What does MyriMatch do?
  96
  97
  98 * Implement MyriMatch charge-calling algorithm?
  99
 100 * Implement MyriMatch deisotoping?
 101
 102
 103 * Look more closely at specific MM vs GL match diffs.
 104
 105
 106 * Figure out how to handle multiple residue mods (delta, isotope, etc)
 107
 108 * Clean up [,] (N/C-terminal) mass regime calculations
 109
 110 * Pass through the C++ code looking for counts that could conceivably overflow
 111 ** Fix or add assertions
 112
 113 * Add duplicate peptide masking optimization
 114 ** This will obviate the need to detect identical best matches at search time?
 115 ** Fix redundant peptide reporting
 116
 117 * Make sure greylag is 64-bit clean.
 118
 119
 120 * Time a real mod search vs SEQUEST (and xtandem?), is time reasonable?
 121   MAKE SURE PARAMS ARE COMPARABLE!  Ballpark correctness?
 122
 123 * Create direct DTASelect.txt output?
 124   (This seems to be sufficient to support most or all DTASelect output.)
 125
 126 * Make a tool to compare greylag vs SEQUEST results by spectrum.  Want gross
 127   statistics--how many id's are the same, different, missing, etc.  For each
 128   spectrum, want to see what each program did, and how many times the assigned
 129   locus was otherwise id'ed.
 130
 131 * Investigate identification differences between greylag and SEQUEST.
 132
 133 * Careful timing and correctness check for
 134   /n/proteomics/mkc/HsProA-Control_S100_Ti_1_H_2006-03-03_wSHUFFLED-greylag
 135
 136 * Design and implement tracing of mass regime/PCA/fixed and non-fixed
 137   deltas/etc into output file.  Try to stay compatible with xtandem.
 138
 139 * PPM error tolerances (MyriMatch doesn't implement this?)
 140
 141 * Make --estimate work correctly over cluster.  (Currently takes 6 hours to
 142   estimate 60--is this worthwhile?  Could we simply estimate one bag and
 143   multiply by the number of bags??)
 144
 145 * Better shuffling than current model.
 146
 147 * Useful to scale fragment tolerance by charge, too?
 148
 149 * Have --estimate generate a spectrum work count file (*.est?) that can be
 150   used by --part-split to generate evenly sized parts.  (Check that file is
 151   newer than params file and ms2 file arguments, and that all ms2 file
 152   arguments were estimated.)
 153
 154 * Maybe --part-split should generate a downramp of sizes?  It definitely
 155   should take into account spectra filtered out (== no work), but this
 156   requires reading all spectra before splitting (which takes more time).
 157
 158 * Fix "cleavage C-terminal mass change" issue.  Should this be interpreted as
 159   MONO, ! (first fragment regime), or by regime.  Look for similar problems
 160   elsewhere.
 161
 162 * Make static '[' mod exclude PCA mods.
 163
 164 * Code cleanup, especially in new Python code.  Maybe put some stuff in
 165   classes.  Could split into multiple source files.
 166
 167 * Mine OMSSA and myrimatch for ideas.  Look again at X!Tandem and SEQUEST
 168   papers.
 169
 170 * Need tool to compare two runs, for regression testing purposes.
 171
 172 * Add refinement.  (like xtandem?)
 173
 174 * Advanced refinement ideas.  For example, only search a locus for a hit with N
 175   mods if we got a hit for it with 0..N-1 mods (or maybe 0..N-2?).  Or, only
 176   search a locus non-tryptically (or semi-tryptically) if we got a tryptic hit
 177   for it.
 178
 179 ** Investigate current SEQUEST search results to see if this looks feasible.
 180
 181 * Think about ways to get more id's per hour of processing time.
 182
 183 * Try to adapt to instrument accuracy.  Maybe start with a narrow parent mass
 184   range and adaptively widen it.
 185
 186 * Profiling to find slow spots, and for correctness?
 187
 188 * Heavy optimization on inner loop.  Try running from both ends simultaneously.
 189
 190 * Should we try to guarantee that searching is equivalent for all mass regimes,
 191   to make comparisons more valid?
 192
 193 * Eval speedup: make intensity information integer, or otherwise store it in
 194   log form so we can add instead of multiplying?
 195
 196 * Try to switch FP code to use integers instead?
 197
 198 * Rigorously check all values coming in from Python (at least by assert).
 199
 200 * We can now pre-build a peptide index if we want to.  The main utility of this
 201   is that it would allow us to avoid searching a spectrum against the same
 202   peptide multiple times (saving perhaps 30% runtime for one real database).
 203   Alternatively, maybe we could just generate a description of peptides to be
 204   masked out.
 205
 206 * Look at moving C++ code to C+ctypes, or maybe pyrex?
 207
 208 * Incrementalize the whole program.  Want to be able to take an existing run
 209   and spend more time on it to get more results, possibly concentrating on a
 210   particular kind of modification.
 211
 212 * Try to figure out whether SEQUEST is really searching everything, or whether
 213   it gives up in certain cases like X!Tandem does.
 214
 215 * Isotope S34 and C13 are common (4%, 1%).  Is there a good way to look for
 216   them?  We could look for singleton occurrences pretty cheaply using a delta
 217   mod type procedure.
 218
 219 * Splitting idea: Rather than having all parts be equal, maybe its better for
 220   the parts processed first to be bigger, with smallest parts processed last,
 221   so that they can fill in the final gaps (leading all processors to finish at
 222   about the same time).  What should the split curve look like?  Linear, but
 223   what slope?  No split should be smaller than one spectrum (after
 224   filtering).
 225
 226 * Look carefully at the statistics code.  Problems?
 227
 228 * Implement "cyclic permutation" of xtandem.
 229
 230 * Possible generation optimization: Figure out the maximum number of mods,
 231   which would be the number that could be added to the smallest, lightest
 232   peptide without exceeding the mass of the largest spectrum parent mass.
 233   Probably not worth doing?  Similarly, if we know all bags of size N are too
 234   large, and all deltas are positive, we can skip larger bags.
 235
 236 * Can we make the parts restartable?  If so, maybe this could be used to load
 237   balance, recover from crashes, etc.
 238
 239 * Is NOTHROW faster or slower?
 240
 241 * If we see a good hit for a spectrum, we could try to see if there's an
 242   identifiable tag.  If so, could restrict further searching to peptides with
 243   that tag.
 244
 245 * Motif-based differential deltas (like xtandem).
 246
 247
 248
 249 TO FILE:
 250
 251 * What is this C+57 mod called?  Carboxyamidomethyl?  +C2H3ON!
 252
 253 * Is there anything we can do with neutral losses?
 254
 255 * test spectrum synthesis
 256
 257 * test semi-tryptic cleavage
 258
 259 * double-check handling of FP arithmetic using epsilons (no ==, no strict <)