TODO

   1 GREYLAG TODO LIST                                               -*-outline-*-
   2
   3
   4 ==============================================================================
   5
   6 OVERALL GOALS
   7
   8 1.  Replace SEQUEST at SIMR with something at least as good.
   9 2.  Do better than SEQUEST for things that SIMR cares about.
  10 3.  Showcase Python w/C-ish inner loop code implementation strategy.
  11 4.  Try to take the best ideas from other similar programs.
  12 5.  Greylag as a pedagogical artifact and foundation for further
  13     experimentation.
  14
  15 ==============================================================================
  16
  17
  18 MILESTONE M1:
  19
  20 * Good first impression
  21 * Basic correctness
  22 * Handles at least LCQ input
  23 * Nonspecific cleavage only?
  24 * Generates SQT output, usable in our pipeline, at least for non-N15 runs
  25 * Decent performance/efficiency on our clusters
  26
  27 MILESTONE M2:
  28
  29 * Basic public website/source release/git archive
  30 * Documentation (asciidoc/man pages)
  31
  32
  33
  34 TASK QUEUE
  35
  36 * MINI-GOAL: Get something working that can be tested against MM/SEQUEST
  37 * MINI-GOAL: basic greylag-process usable on our cluster (no mods)
  38
  39
  40 * Need to figure out how to properly merge duplicate peptides within grind
  41
  42   There always needs to be a second-best score in order to have a DeltCN!
  43
  44 * Implement SQT output program (greylag-sqt)
  45 ** R lines to document mass regimes
  46 ** A lines for each M line to document regime (index of R) and residue mods
  47 ** Try to keep back compatibility by grepping out A/R lines
  48
  49 * Compare greylag output to SEQUEST on test-myrimatch example
  50 ** also compare against MyriMatch
  51 *** do a non-specific run with the existing binary
  52
  53 * Try to cut unneeded SWIG types
  54
  55 * Compare greylag vs Myrimatch on test-2 case
  56
  57 * Design and implement greylag master process (work manifests?)
  58
  59 * Evaluate performance differences
  60
  61 * Basic test of DTASelect compatibility
  62
  63 * Design and implement SQT fileset merger
  64
  65 * Update docstrings
  66 * Update test cases
  67
  68 * Look for more dead code to remove
  69
  70 * Redo cleavage point code (enzyme and non-specific)
  71
  72 = M1 =========================================================================
  73
  74 * Implement MM smart +3 model?
  75 ** Is it better?
  76
  77 * Try to generate a valid MyriMatch (bombs on boost random assertion)
  78
  79 * Try the MM precursor mass adjustment--much improvement?  even a good idea?
  80
  81 * Test tolerance monotonicity
  82
  83
  84 * Examine DBValidate
  85 ** Design similar statistical evaluation
  86 ** Look at what we do here (paper)
  87
  88
  89 * Add isotope jitter feature, for Orbitrap.
  90
  91   xtandem considers one C13 if MH>1000, and one/two C13 if MH>1500.  Should we
  92   try to predict this based on the peptide sequence?  MH probably close
  93   enough.  What does MyriMatch do?
  94
  95
  96 * Implement MyriMatch charge-calling algorithm?
  97
  98 * Implement MyriMatch deisotoping?
  99
 100
 101 * Figure out how to handle multiple residue mods (delta, isotope, etc)
 102
 103 * Clean up [,] (N/C-terminal) mass regime calculations
 104
 105 * Pass through the C++ code looking for counts that could conceivably overflow
 106 ** Fix or add assertions
 107
 108 * Add duplicate peptide masking optimization
 109 ** This will obviate the need to detect identical best matches at search time?
 110 ** Fix redundant peptide reporting
 111
 112
 113 * Time a real mod search vs SEQUEST (and xtandem?), is time reasonable?
 114   MAKE SURE PARAMS ARE COMPARABLE!  Ballpark correctness?
 115
 116 * Create direct DTASelect.txt output?
 117   (This seems to be sufficient to support most or all DTASelect output.)
 118
 119 * Make a tool to compare greylag vs SEQUEST results by spectrum.  Want gross
 120   statistics--how many id's are the same, different, missing, etc.  For each
 121   spectrum, want to see what each program did, and how many times the assigned
 122   locus was otherwise id'ed.
 123
 124 * Investigate identification differences between greylag and SEQUEST.
 125
 126 * Careful timing and correctness check for
 127   /n/proteomics/mkc/HsProA-Control_S100_Ti_1_H_2006-03-03_wSHUFFLED-greylag
 128
 129 * Design and implement tracing of mass regime/PCA/fixed and non-fixed
 130   deltas/etc into output file.  Try to stay compatible with xtandem.
 131
 132 * PPM error tolerances (MyriMatch doesn't implement this?)
 133
 134 * Make --estimate work correctly over cluster.  (Currently takes 6 hours to
 135   estimate 60--is this worthwhile?  Could we simply estimate one bag and
 136   multiply by the number of bags??)
 137
 138 * Better shuffling than current model.
 139
 140 * Useful to scale fragment tolerance by charge, too?
 141
 142 * Have --estimate generate a spectrum work count file (*.est?) that can be
 143   used by --part-split to generate evenly sized parts.  (Check that file is
 144   newer than params file and ms2 file arguments, and that all ms2 file
 145   arguments were estimated.)
 146
 147 * Maybe --part-split should generate a downramp of sizes?  It definitely
 148   should take into account spectra filtered out (== no work), but this
 149   requires reading all spectra before splitting (which takes more time).
 150
 151 * Fix "cleavage C-terminal mass change" issue.  Should this be interpreted as
 152   MONO, ! (first fragment regime), or by regime.  Look for similar problems
 153   elsewhere.
 154
 155 * Make static '[' mod exclude PCA mods.
 156
 157 * Code cleanup, especially in new Python code.  Maybe put some stuff in
 158   classes.  Could split into multiple source files.
 159
 160 * Mine OMSSA and myrimatch for ideas.  Look again at X!Tandem and SEQUEST
 161   papers.
 162
 163 * Need tool to compare two runs, for regression testing purposes.
 164
 165 * Add refinement.  (like xtandem?)
 166
 167 * Advanced refinement ideas.  For example, only search a locus for a hit with N
 168   mods if we got a hit for it with 0..N-1 mods (or maybe 0..N-2?).  Or, only
 169   search a locus non-tryptically (or semi-tryptically) if we got a tryptic hit
 170   for it.
 171
 172 ** Investigate current SEQUEST search results to see if this looks feasible.
 173
 174 * Think about ways to get more id's per hour of processing time.
 175
 176 * Try to adapt to instrument accuracy.  Maybe start with a narrow parent mass
 177   range and adaptively widen it.
 178
 179 * Profiling to find slow spots, and for correctness?
 180
 181 * Heavy optimization on inner loop.  Try running from both ends simultaneously.
 182
 183 * Should we try to guarantee that searching is equivalent for all mass regimes,
 184   to make comparisons more valid?
 185
 186 * Eval speedup: make intensity information integer, or otherwise store it in
 187   log form so we can add instead of multiplying?
 188
 189 * Try to switch FP code to use integers instead?
 190
 191 * Rigorously check all values coming in from Python (at least by assert).
 192
 193 * We can now pre-build a peptide index if we want to.  The main utility of this
 194   is that it would allow us to avoid searching a spectrum against the same
 195   peptide multiple times (saving perhaps 30% runtime for one real database).
 196   Alternatively, maybe we could just generate a description of peptides to be
 197   masked out.
 198
 199 * Look at moving C++ code to C+ctypes, or maybe pyrex?
 200
 201 * Incrementalize the whole program.  Want to be able to take an existing run
 202   and spend more time on it to get more results, possibly concentrating on a
 203   particular kind of modification.
 204
 205 * Try to figure out whether SEQUEST is really searching everything, or whether
 206   it gives up in certain cases like X!Tandem does.
 207
 208 * Isotope S34 and C13 are common (4%, 1%).  Is there a good way to look for
 209   them?  We could look for singleton occurrences pretty cheaply using a delta
 210   mod type procedure.
 211
 212 * Splitting idea: Rather than having all parts be equal, maybe its better for
 213   the parts processed first to be bigger, with smallest parts processed last,
 214   so that they can fill in the final gaps (leading all processors to finish at
 215   about the same time).  What should the split curve look like?  Linear, but
 216   what slope?  No split should be smaller than one spectrum (after
 217   filtering).
 218
 219 * Look carefully at the statistics code.  Problems?
 220
 221 * Implement "cyclic permutation" of xtandem.
 222
 223 * Possible generation optimization: Figure out the maximum number of mods,
 224   which would be the number that could be added to the smallest, lightest
 225   peptide without exceeding the mass of the largest spectrum parent mass.
 226   Probably not worth doing?  Similarly, if we know all bags of size N are too
 227   large, and all deltas are positive, we can skip larger bags.
 228
 229 * Can we make the parts restartable?  If so, maybe this could be used to load
 230   balance, recover from crashes, etc.
 231
 232 * Is NOTHROW faster or slower?
 233
 234 * If we see a good hit for a spectrum, we could try to see if there's an
 235   identifiable tag.  If so, could restrict further searching to peptides with
 236   that tag.
 237
 238 * Motif-based differential deltas (like xtandem).
 239
 240
 241
 242 TO FILE:
 243
 244 * What is this C+57 mod called?  Carboxyamidomethyl?  +C2H3ON!
 245
 246 * Is there anything we can do with neutral losses?
 247
 248 * test spectrum synthesis
 249
 250 * test semi-tryptic cleavage
 251
 252 * double-check handling of FP arithmetic using epsilons (no ==, no strict <)