TODO

   1 GREYLAG TODO LIST                                               -*-outline-*-
   2
   3
   4 ==============================================================================
   5
   6 OVERALL GOALS
   7
   8 1.  Replace SEQUEST at SIMR with something at least as good.
   9 2.  Do better than SEQUEST for things that SIMR cares about.
  10 3.  Showcase Python w/C-ish inner loop code implementation strategy.
  11 4.  Try to take the best ideas from other similar programs.
  12 5.  Greylag as a pedagogical artifact and foundation for further
  13     experimentation.
  14
  15 ==============================================================================
  16
  17
  18 MILESTONE M1:
  19
  20 * Good first impression
  21 * Basic correctness
  22 * Handles at least LCQ input
  23 * Handles nonspecific cleavage
  24 * Generates SQT output, usable in our pipeline, at least for non-N15 runs
  25 * Decent performance/efficiency on our clusters
  26 * Basic how-to-use-it documentation
  27
  28 MILESTONE M2:
  29
  30 * Handles tryptic/etc cleavage
  31 * Basic N15 handling
  32 * Basic public website/source release/git archive
  33 * Documentation (asciidoc/man pages)
  34
  35
  36
  37 TASK QUEUE
  38
  39 * Clean up and check in working part of test suite
  40
  41 * Check that greylag-* errors if no input files specified
  42
  43 * Look for (user-visible) changes that are easy to make now but harder later...
  44
  45 * Add missing-feature matrix (wrt other programs, etc)
  46
  47 * Create basic documentation in reST format
  48 ** Need HTML (web and standalone), PDF?
  49 *** index page (what is it?  why should you care?  links)
  50 *** user guide
  51 **** installation
  52 **** cluster use
  53 *** theory of operation document
  54 **** basics of id search
  55 **** explanation of greylag algorithm
  56 **** greylag usage scenarios (solo, cluster)
  57 **** how does MuDPIT work?
  58 **** design choices, technology rationale (upsides/downsides)
  59 ** Makefile, install-to-web targets
  60
  61
  62 * Generate a set of sample .conf files
  63 ** One long, commented template file?
  64
  65 * Write release email
  66 ** What it does and doesn't yet do
  67 ** How to use it
  68
  69
  70 * Try DTASelect on greylag SQT files
  71 ** plus mod case
  72 ** sqt-index works?
  73 ** spectrum viewing works?
  74 ** astoria?
  75
  76
  77 = M1 =========================================================================
  78
  79 * Rework test suite
  80 ** check in for complete release
  81
  82
  83
  84 * Try greylag-validate refinement
  85
  86 * Implement semi-tryptic cleavage
  87 ** Reputed to be almost as good as tryptic (valid peptide count)?
  88
  89 * Debian/Ubuntu package?
  90
  91 * greylag-validate parent tolerance analysis
  92
  93 * Add docstring for every function
  94
  95 * Need some greylag-merge test cases
  96 ** Test RAM requirements on large files
  97
  98 * Basic optimization (just a quick further look for easy speedups)
  99 ** callgrind
 100 ** cachegrind
 101
 102 * Change the way mod limit works?  ({1,4} feature?)
 103
 104
 105 * Update docstrings
 106
 107 * calculate Ion%?
 108
 109 * Further test case updates/adds
 110 ** enzymatic cleavage
 111
 112 * Use SHA1 digests to keep files in sync?
 113
 114 * Check on score stability
 115
 116 * Add A1S2D1C3A1S2D1F4 marking form for isotope regimes?
 117
 118 * Register copyright?
 119
 120
 121 * More testing of SIMR cases
 122 ** no mods
 123 ** single mods
 124 ** multiple mods
 125 ** multiple regimes
 126
 127 * Think about how to implement LAFs PTM pipeline
 128
 129
 130 * Try to test against MyriMatch (results should be similar)
 131
 132
 133 * How should greylag-validate handle I/L, P/V, etc?
 134
 135
 136 = M2 =========================================================================
 137
 138
 139 * Evaluate performance differences vs Xtandem?
 140
 141 * Try the MM smart +3 model--much improvement?
 142
 143 * Try the MM precursor mass adjustment--much improvement?  even a good idea?
 144
 145 * Add isotope jitter feature, for Orbitrap.
 146
 147   xtandem considers one C13 if MH>1000, and one/two C13 if MH>1500.  Should we
 148   try to predict this based on the peptide sequence?  MH probably close
 149   enough.  What does MyriMatch do?
 150
 151
 152 * Implement MyriMatch charge-calling algorithm?
 153
 154 * Implement MyriMatch deisotoping?
 155
 156
 157 * Investigate identification differences between greylag and SEQUEST/MM.
 158
 159 * Clean up [,] (N/C-terminal) mass regime calculations
 160
 161 * Pass through the C++ code looking for counts that could conceivably overflow
 162 ** Fix or add assertions
 163
 164 * Add duplicate peptide masking optimization
 165 ** Problem: shuffled versions generally not identical.
 166 *** Limits potential speedup to 25-30%
 167 ** This will obviate the need to detect identical best matches at search time?
 168 ** Fix redundant peptide reporting
 169
 170 * Make a tool to compare greylag vs SEQUEST results by spectrum.
 171
 172   Want gross statistics--how many id's are the same, different, missing, etc.
 173   For each spectrum, want to see what each program did, and how many times the
 174   assigned locus was otherwise id'ed.
 175
 176 * PPM error tolerances (MyriMatch doesn't implement this?)
 177
 178 * Make --estimate work correctly over cluster.  (Currently takes 6 hours to
 179   estimate 60--is this worthwhile?  Could we simply estimate one bag and
 180   multiply by the number of bags??)
 181
 182 * Better shuffling than current model.
 183
 184 * Useful to scale fragment tolerance by charge, too?
 185
 186 * Have --estimate generate a spectrum work count file (*.est?) that can be
 187   used by --part-split to generate evenly sized parts.  (Check that file is
 188   newer than params file and ms2 file arguments, and that all ms2 file
 189   arguments were estimated.)
 190
 191 * Fix "cleavage C-terminal mass change" issue.  Should this be interpreted as
 192   MONO, ! (first fragment regime), or by regime.  Look for similar problems
 193   elsewhere.
 194
 195 * Make static '[' mod exclude PCA mods.
 196
 197 * Mine OMSSA and myrimatch for ideas.  Look again at X!Tandem and SEQUEST
 198   papers.
 199
 200 * Add refinement.  (like xtandem?)
 201
 202 * Advanced refinement ideas.  For example, only search a locus for a hit with N
 203   mods if we got a hit for it with 0..N-1 mods (or maybe 0..N-2?).  Or, only
 204   search a locus non-tryptically (or semi-tryptically) if we got a tryptic hit
 205   for it.
 206
 207 ** Investigate current SEQUEST search results to see if this looks feasible.
 208
 209 * Think about ways to get more id's per hour of processing time.
 210
 211 * Try to adapt to instrument accuracy.  Maybe start with a narrow parent mass
 212   range and adaptively widen it.
 213
 214 * Profiling to find slow spots, and for correctness?
 215
 216 * Heavy optimization on inner loop.
 217 ** Try running from both ends simultaneously.
 218 ** Watch cache usage.
 219
 220 * Rigorously check all values coming in from Python (at least by assert).
 221
 222 * Incrementalize the whole program.  Want to be able to take an existing run
 223   and spend more time on it to get more results, possibly concentrating on a
 224   particular kind of modification.
 225
 226 * Try to figure out whether SEQUEST is really searching everything, or whether
 227   it gives up in certain cases like X!Tandem does.
 228
 229 * Isotope S34 and C13 are common (4%, 1%).  Is there a good way to look for
 230   them?  We could look for singleton occurrences pretty cheaply using a delta
 231   mod type procedure.
 232
 233 * Could try switching FP code to use integers instead (but ugh)
 234
 235 * Is there anything we can do with neutral losses?
 236
 237 * Need a way to test spectrum synthesis?