TODO

   1 GREYLAG TODO LIST                                               -*-outline-*-
   2
   3
   4 ==============================================================================
   5
   6 OVERALL GOALS
   7
   8 1.  Replace SEQUEST at SIMR with something at least as good.
   9 2.  Do better than SEQUEST for things that SIMR cares about.
  10 3.  Showcase Python w/C-ish inner loop code implementation strategy.
  11 4.  Try to take the best ideas from other similar programs.
  12 5.  Greylag as a pedagogical artifact and foundation for further
  13     experimentation.
  14
  15 ==============================================================================
  16
  17
  18 MILESTONE M1:
  19
  20 * Good first impression
  21 * Basic correctness
  22 * Handles at least LCQ input
  23 * Nonspecific cleavage only?
  24 * Generates SQT output, usable in our pipeline, at least for non-N15 runs
  25 * Decent performance/efficiency on our clusters
  26
  27 MILESTONE M2:
  28
  29 * Basic public website/source release/git archive
  30 * Documentation (asciidoc/man pages)
  31
  32
  33
  34 TASK QUEUE
  35
  36 * MINI-GOAL: Get something working that can be tested against MM/SEQUEST
  37 * MINI-GOAL: basic greylag-process usable on our cluster (no mods)
  38
  39
  40 * Redo cleavage point code (enzyme and non-specific)
  41 ** speed/space?
  42
  43 * Basic optimization
  44 ** look at callgrind output
  45 *** search_run
  46 ** maybe de-STL things
  47 *** use an mz list instead of a theoretical spectrum
  48
  49 * Check compile/results on 64-bit host
  50
  51 * Update test cases
  52
  53 * Mod/regime info in SQT output program (greylag-sqt)
  54 ** R lines to document mass regimes
  55 ** A lines for each M line to document regime (index of R) and residue mods
  56 ** Try to keep back compatibility by grepping out A/R lines
  57
  58 * update estimate factor
  59
  60 * Update docstrings
  61
  62 * Evaluate performance differences vs SEQUEST/MM/Xtandem?
  63
  64 * Design and implement greylag master process (work manifests?)
  65
  66 * Look for more dead code to remove
  67
  68 * Look at memory usage
  69 ** maybe avoid spectrum name copies
  70 ** maybe avoid locus name copies
  71 ** instead of copying db sequences, use Python's?
  72 ** kill off cleavage point lists (4M?)
  73
  74 * Compare greylag/SEQUEST/MM on test-myrimatch example (non-specific)
  75 ** look at MM (whole file)
  76 ** Note: SEQUEST parent tolerance differs
  77
  78
  79 = M1 =========================================================================
  80
  81 * greylag-solo
  82
  83 * Implement MM smart +3 model?
  84 ** Is it better?
  85
  86 * Try to generate a valid MyriMatch (bombs on boost random assertion)
  87
  88 * Try the MM precursor mass adjustment--much improvement?  even a good idea?
  89
  90 * Test tolerance monotonicity
  91
  92
  93 * Examine DBValidate
  94 ** Design similar statistical evaluation
  95 ** Look at what we do here (paper)
  96
  97
  98 * Add isotope jitter feature, for Orbitrap.
  99
 100   xtandem considers one C13 if MH>1000, and one/two C13 if MH>1500.  Should we
 101   try to predict this based on the peptide sequence?  MH probably close
 102   enough.  What does MyriMatch do?
 103
 104
 105 * Implement MyriMatch charge-calling algorithm?
 106
 107 * Implement MyriMatch deisotoping?
 108
 109
 110 * Look more closely at specific MM vs GL match diffs.
 111
 112
 113 * Figure out how to handle multiple residue mods (delta, isotope, etc)
 114
 115 * Clean up [,] (N/C-terminal) mass regime calculations
 116
 117 * Pass through the C++ code looking for counts that could conceivably overflow
 118 ** Fix or add assertions
 119
 120 * Add duplicate peptide masking optimization
 121 ** This will obviate the need to detect identical best matches at search time?
 122 ** Fix redundant peptide reporting
 123
 124 * Make sure greylag is 64-bit clean.
 125
 126
 127 * Time a real mod search vs SEQUEST (and xtandem?), is time reasonable?
 128   MAKE SURE PARAMS ARE COMPARABLE!  Ballpark correctness?
 129
 130 * Create direct DTASelect.txt output?
 131   (This seems to be sufficient to support most or all DTASelect output.)
 132
 133 * Make a tool to compare greylag vs SEQUEST results by spectrum.  Want gross
 134   statistics--how many id's are the same, different, missing, etc.  For each
 135   spectrum, want to see what each program did, and how many times the assigned
 136   locus was otherwise id'ed.
 137
 138 * Investigate identification differences between greylag and SEQUEST.
 139
 140 * Careful timing and correctness check for
 141   /n/proteomics/mkc/HsProA-Control_S100_Ti_1_H_2006-03-03_wSHUFFLED-greylag
 142
 143 * Design and implement tracing of mass regime/PCA/fixed and non-fixed
 144   deltas/etc into output file.  Try to stay compatible with xtandem.
 145
 146 * PPM error tolerances (MyriMatch doesn't implement this?)
 147
 148 * Make --estimate work correctly over cluster.  (Currently takes 6 hours to
 149   estimate 60--is this worthwhile?  Could we simply estimate one bag and
 150   multiply by the number of bags??)
 151
 152 * Better shuffling than current model.
 153
 154 * Useful to scale fragment tolerance by charge, too?
 155
 156 * Have --estimate generate a spectrum work count file (*.est?) that can be
 157   used by --part-split to generate evenly sized parts.  (Check that file is
 158   newer than params file and ms2 file arguments, and that all ms2 file
 159   arguments were estimated.)
 160
 161 * Maybe --part-split should generate a downramp of sizes?  It definitely
 162   should take into account spectra filtered out (== no work), but this
 163   requires reading all spectra before splitting (which takes more time).
 164
 165 * Fix "cleavage C-terminal mass change" issue.  Should this be interpreted as
 166   MONO, ! (first fragment regime), or by regime.  Look for similar problems
 167   elsewhere.
 168
 169 * Make static '[' mod exclude PCA mods.
 170
 171 * Code cleanup, especially in new Python code.  Maybe put some stuff in
 172   classes.  Could split into multiple source files.
 173
 174 * Mine OMSSA and myrimatch for ideas.  Look again at X!Tandem and SEQUEST
 175   papers.
 176
 177 * Need tool to compare two runs, for regression testing purposes.
 178
 179 * Add refinement.  (like xtandem?)
 180
 181 * Advanced refinement ideas.  For example, only search a locus for a hit with N
 182   mods if we got a hit for it with 0..N-1 mods (or maybe 0..N-2?).  Or, only
 183   search a locus non-tryptically (or semi-tryptically) if we got a tryptic hit
 184   for it.
 185
 186 ** Investigate current SEQUEST search results to see if this looks feasible.
 187
 188 * Think about ways to get more id's per hour of processing time.
 189
 190 * Try to adapt to instrument accuracy.  Maybe start with a narrow parent mass
 191   range and adaptively widen it.
 192
 193 * Profiling to find slow spots, and for correctness?
 194
 195 * Heavy optimization on inner loop.  Try running from both ends simultaneously.
 196
 197 * Should we try to guarantee that searching is equivalent for all mass regimes,
 198   to make comparisons more valid?
 199
 200 * Eval speedup: make intensity information integer, or otherwise store it in
 201   log form so we can add instead of multiplying?
 202
 203 * Try to switch FP code to use integers instead?
 204
 205 * Rigorously check all values coming in from Python (at least by assert).
 206
 207 * We can now pre-build a peptide index if we want to.  The main utility of this
 208   is that it would allow us to avoid searching a spectrum against the same
 209   peptide multiple times (saving perhaps 30% runtime for one real database).
 210   Alternatively, maybe we could just generate a description of peptides to be
 211   masked out.
 212
 213 * Look at moving C++ code to C+ctypes, or maybe pyrex?
 214
 215 * Incrementalize the whole program.  Want to be able to take an existing run
 216   and spend more time on it to get more results, possibly concentrating on a
 217   particular kind of modification.
 218
 219 * Try to figure out whether SEQUEST is really searching everything, or whether
 220   it gives up in certain cases like X!Tandem does.
 221
 222 * Isotope S34 and C13 are common (4%, 1%).  Is there a good way to look for
 223   them?  We could look for singleton occurrences pretty cheaply using a delta
 224   mod type procedure.
 225
 226 * Splitting idea: Rather than having all parts be equal, maybe its better for
 227   the parts processed first to be bigger, with smallest parts processed last,
 228   so that they can fill in the final gaps (leading all processors to finish at
 229   about the same time).  What should the split curve look like?  Linear, but
 230   what slope?  No split should be smaller than one spectrum (after
 231   filtering).
 232
 233 * Look carefully at the statistics code.  Problems?
 234
 235 * Implement "cyclic permutation" of xtandem.
 236
 237 * Possible generation optimization: Figure out the maximum number of mods,
 238   which would be the number that could be added to the smallest, lightest
 239   peptide without exceeding the mass of the largest spectrum parent mass.
 240   Probably not worth doing?  Similarly, if we know all bags of size N are too
 241   large, and all deltas are positive, we can skip larger bags.
 242
 243 * Can we make the parts restartable?  If so, maybe this could be used to load
 244   balance, recover from crashes, etc.
 245
 246 * Is NOTHROW faster or slower?
 247
 248 * If we see a good hit for a spectrum, we could try to see if there's an
 249   identifiable tag.  If so, could restrict further searching to peptides with
 250   that tag.
 251
 252 * Motif-based differential deltas (like xtandem).
 253
 254
 255
 256 TO FILE:
 257
 258 * What is this C+57 mod called?  Carboxyamidomethyl?  +C2H3ON!
 259
 260 * Is there anything we can do with neutral losses?
 261
 262 * test spectrum synthesis
 263
 264 * test semi-tryptic cleavage
 265
 266 * double-check handling of FP arithmetic using epsilons (no ==, no strict <)