TODO

   1 GREYLAG TODO LIST                                               -*-outline-*-
   2
   3
   4 ==============================================================================
   5
   6 OVERALL GOALS
   7
   8 1.  Replace SEQUEST at SIMR with something at least as good.
   9 2.  Do better than SEQUEST for things that SIMR cares about.
  10 3.  Showcase Python w/C-ish inner loop code implementation strategy.
  11 4.  Try to take the best ideas from other similar programs.
  12 5.  Greylag as a pedagogical artifact and foundation for further
  13     experimentation.
  14
  15 ==============================================================================
  16
  17
  18 MILESTONE M1:
  19
  20 * Good first impression
  21 * Basic correctness
  22 * Handles at least LCQ input
  23 * Nonspecific cleavage only?
  24 * Generates SQT output, usable in our pipeline, at least for non-N15 runs
  25 * Decent performance/efficiency on our clusters
  26
  27 MILESTONE M2:
  28
  29 * Basic public website/source release/git archive
  30 * Documentation (asciidoc/man pages)
  31
  32
  33
  34 TASK QUEUE
  35
  36 * MINI-GOAL: Get something working that can be tested against MM/SEQUEST
  37 * MINI-GOAL: basic greylag-process usable on our cluster (no mods)
  38
  39
  40 * Mod/regime info in SQT output program (greylag-sqt)
  41 ** R lines to document mass regimes
  42
  43   R     0       A       71.037114
  44   R     0       C       103.009184
  45   ...
  46
  47   regime + fixed mods + pca mod + diff mod + terminal mod + isotope mods (N15 or
  48     natural)
  49
  50 1 for back compatibility, support ASDC*ASDF form, and maybe A1S2D1C3A1S2D1F4
  51        form
  52 1 need a means to specify mapping preference?
  53
  54 ** A lines for each M line to document regime (index of R) and residue mods
  55
  56 3 AR    0
  57 4 APCA  +17.0
  58 2 AM    12      +32.34  phosphorylation
  59 3 AM    5       +57.0
  60 4 AT    ]       +18.0   name
  61 4 AI    14      +1
  62
  63 ** are these required?
  64   H       StaticMod       C=160.1388
  65   H       DiffMod TNA*=+2.0
  66
  67
  68
  69 ** Try to keep back compatibility by grepping out A/R lines
  70
  71 * Why aren't scores a little more stable?
  72
  73 * update estimate factor
  74
  75 * Update docstrings
  76
  77 * Evaluate performance differences vs SEQUEST/MM/Xtandem?
  78
  79 * Design and implement greylag master process (work manifests?)
  80
  81 * Look for more dead code to remove
  82
  83 * Look at memory usage
  84 ** maybe avoid spectrum name copies
  85 ** maybe avoid locus name copies
  86 ** instead of copying db sequences, use Python's?
  87
  88 * Compare greylag/SEQUEST/MM on test-myrimatch example (non-specific)
  89 ** look at MM (whole file)
  90 ** Note: SEQUEST parent tolerance differs
  91
  92 * Basic optimization
  93 ** look at callgrind output
  94
  95 * Further test case updates/adds
  96
  97
  98 = M1 =========================================================================
  99
 100
 101
 102 * do a cg-admin-rewritehist before publishing git archive?
 103
 104 = M2 =========================================================================
 105
 106 * Implement MM smart +3 model?
 107 ** Is it better?
 108
 109 * Try to generate a valid MyriMatch (bombs on boost random assertion)
 110
 111 * Try the MM precursor mass adjustment--much improvement?  even a good idea?
 112
 113 * Test tolerance monotonicity
 114
 115
 116 * Examine DBValidate
 117 ** Design similar statistical evaluation
 118 ** Look at what we do here (paper)
 119
 120
 121 * Add isotope jitter feature, for Orbitrap.
 122
 123   xtandem considers one C13 if MH>1000, and one/two C13 if MH>1500.  Should we
 124   try to predict this based on the peptide sequence?  MH probably close
 125   enough.  What does MyriMatch do?
 126
 127
 128 * Implement MyriMatch charge-calling algorithm?
 129
 130 * Implement MyriMatch deisotoping?
 131
 132
 133 * Look more closely at specific MM vs GL match diffs.
 134
 135
 136 * Figure out how to handle multiple residue mods (delta, isotope, etc)
 137
 138 * Clean up [,] (N/C-terminal) mass regime calculations
 139
 140 * Pass through the C++ code looking for counts that could conceivably overflow
 141 ** Fix or add assertions
 142
 143 * Add duplicate peptide masking optimization
 144 ** This will obviate the need to detect identical best matches at search time?
 145 ** Fix redundant peptide reporting
 146
 147
 148 * Time a real mod search vs SEQUEST (and xtandem?), is time reasonable?
 149   MAKE SURE PARAMS ARE COMPARABLE!  Ballpark correctness?
 150
 151 * Create direct DTASelect.txt output?
 152   (This seems to be sufficient to support most or all DTASelect output.)
 153
 154 * Make a tool to compare greylag vs SEQUEST results by spectrum.  Want gross
 155   statistics--how many id's are the same, different, missing, etc.  For each
 156   spectrum, want to see what each program did, and how many times the assigned
 157   locus was otherwise id'ed.
 158
 159 * Investigate identification differences between greylag and SEQUEST.
 160
 161 * Careful timing and correctness check for
 162   /n/proteomics/mkc/HsProA-Control_S100_Ti_1_H_2006-03-03_wSHUFFLED-greylag
 163
 164 * Design and implement tracing of mass regime/PCA/fixed and non-fixed
 165   deltas/etc into output file.  Try to stay compatible with xtandem.
 166
 167 * PPM error tolerances (MyriMatch doesn't implement this?)
 168
 169 * Make --estimate work correctly over cluster.  (Currently takes 6 hours to
 170   estimate 60--is this worthwhile?  Could we simply estimate one bag and
 171   multiply by the number of bags??)
 172
 173 * Better shuffling than current model.
 174
 175 * Useful to scale fragment tolerance by charge, too?
 176
 177 * Have --estimate generate a spectrum work count file (*.est?) that can be
 178   used by --part-split to generate evenly sized parts.  (Check that file is
 179   newer than params file and ms2 file arguments, and that all ms2 file
 180   arguments were estimated.)
 181
 182 * Maybe --part-split should generate a downramp of sizes?  It definitely
 183   should take into account spectra filtered out (== no work), but this
 184   requires reading all spectra before splitting (which takes more time).
 185
 186 * Fix "cleavage C-terminal mass change" issue.  Should this be interpreted as
 187   MONO, ! (first fragment regime), or by regime.  Look for similar problems
 188   elsewhere.
 189
 190 * Make static '[' mod exclude PCA mods.
 191
 192 * Code cleanup, especially in new Python code.  Maybe put some stuff in
 193   classes.  Could split into multiple source files.
 194
 195 * Mine OMSSA and myrimatch for ideas.  Look again at X!Tandem and SEQUEST
 196   papers.
 197
 198 * Need tool to compare two runs, for regression testing purposes.
 199
 200 * Add refinement.  (like xtandem?)
 201
 202 * Advanced refinement ideas.  For example, only search a locus for a hit with N
 203   mods if we got a hit for it with 0..N-1 mods (or maybe 0..N-2?).  Or, only
 204   search a locus non-tryptically (or semi-tryptically) if we got a tryptic hit
 205   for it.
 206
 207 ** Investigate current SEQUEST search results to see if this looks feasible.
 208
 209 * Think about ways to get more id's per hour of processing time.
 210
 211 * Try to adapt to instrument accuracy.  Maybe start with a narrow parent mass
 212   range and adaptively widen it.
 213
 214 * Profiling to find slow spots, and for correctness?
 215
 216 * Heavy optimization on inner loop.  Try running from both ends simultaneously.
 217
 218 * Should we try to guarantee that searching is equivalent for all mass regimes,
 219   to make comparisons more valid?
 220
 221 * Eval speedup: make intensity information integer, or otherwise store it in
 222   log form so we can add instead of multiplying?
 223
 224 * Try to switch FP code to use integers instead?
 225
 226 * Rigorously check all values coming in from Python (at least by assert).
 227
 228 * We can now pre-build a peptide index if we want to.  The main utility of this
 229   is that it would allow us to avoid searching a spectrum against the same
 230   peptide multiple times (saving perhaps 30% runtime for one real database).
 231   Alternatively, maybe we could just generate a description of peptides to be
 232   masked out.
 233
 234 * Look at moving C++ code to C+ctypes, or maybe pyrex?
 235
 236 * Incrementalize the whole program.  Want to be able to take an existing run
 237   and spend more time on it to get more results, possibly concentrating on a
 238   particular kind of modification.
 239
 240 * Try to figure out whether SEQUEST is really searching everything, or whether
 241   it gives up in certain cases like X!Tandem does.
 242
 243 * Isotope S34 and C13 are common (4%, 1%).  Is there a good way to look for
 244   them?  We could look for singleton occurrences pretty cheaply using a delta
 245   mod type procedure.
 246
 247 * Splitting idea: Rather than having all parts be equal, maybe its better for
 248   the parts processed first to be bigger, with smallest parts processed last,
 249   so that they can fill in the final gaps (leading all processors to finish at
 250   about the same time).  What should the split curve look like?  Linear, but
 251   what slope?  No split should be smaller than one spectrum (after
 252   filtering).
 253
 254 * Possible generation optimization: Figure out the maximum number of mods,
 255   which would be the number that could be added to the smallest, lightest
 256   peptide without exceeding the mass of the largest spectrum parent mass.
 257   Probably not worth doing?  Similarly, if we know all bags of size N are too
 258   large, and all deltas are positive, we can skip larger bags.
 259
 260 * Can we make the parts restartable?  If so, maybe this could be used to load
 261   balance, recover from crashes, etc.
 262
 263 * Is NOTHROW faster or slower?
 264
 265 * If we see a good hit for a spectrum, we could try to see if there's an
 266   identifiable tag.  If so, could restrict further searching to peptides with
 267   that tag.
 268
 269
 270 TO FILE:
 271
 272 * What is this C+57 mod called?  Carboxyamidomethyl?  +C2H3ON!
 273
 274 * Is there anything we can do with neutral losses?
 275
 276 * test spectrum synthesis
 277
 278 * test semi-tryptic cleavage
 279
 280 * double-check handling of FP arithmetic using epsilons (no ==, no strict <)