TODO

   1 GREYLAG TODO LIST                                               -*-outline-*-
   2
   3
   4 ==============================================================================
   5
   6 OVERALL GOALS
   7
   8 1.  Replace SEQUEST at SIMR with something at least as good.
   9 2.  Do better than SEQUEST for things that SIMR cares about.
  10 3.  Showcase Python w/C-ish inner loop code implementation strategy.
  11 4.  Try to take the best ideas from other similar programs.
  12 5.  Greylag as a pedagogical artifact and foundation for further
  13     experimentation.
  14
  15 ==============================================================================
  16
  17
  18 MILESTONE M1:
  19
  20 * Good first impression
  21 * Basic correctness
  22 * Handles at least LCQ input
  23 * Handles nonspecific cleavage
  24 * Generates SQT output, usable in our pipeline, at least for non-N15 runs
  25 * Decent performance/efficiency on our clusters
  26 * Basic how-to-use-it documentation
  27
  28 MILESTONE M2:
  29
  30 * Handles tryptic/etc cleavage
  31 * Basic N15 handling
  32 * Basic public website/source release/git archive
  33 * Documentation (asciidoc/man pages)
  34
  35
  36
  37 TASK QUEUE
  38
  39 * Clean up and check in working part of test suite
  40
  41 * Check that greylag-* errors if no input files specified
  42
  43 * Look for (user-visible) changes that are easy to make now but harder later...
  44
  45 * Add missing-feature matrix (wrt other programs, etc)
  46
  47 * Create basic documentation in reST format
  48 ** Need HTML (web and standalone), PDF?
  49 *** index page (what is it?  why should you care?  links)
  50 *** user guide
  51 **** installation
  52 **** cluster use
  53 *** theory of operation document
  54 **** basics of id search
  55 **** explanation of greylag algorithm
  56 **** greylag usage scenarios (solo, cluster)
  57 **** how does MuDPIT work?
  58 **** design choices, technology rationale (upsides/downsides)
  59 ** Makefile, install-to-web targets
  60
  61
  62 * Generate a set of sample .conf files
  63 ** One long, commented template file?
  64
  65 * Write release email
  66 ** What it does and doesn't yet do
  67 ** How to use it
  68
  69
  70
  71 = M1 =========================================================================
  72
  73 * Rework test suite
  74 ** check in for complete release
  75
  76
  77
  78 * Implement semi-tryptic cleavage
  79 ** Reputed to be almost as good as tryptic (valid peptide count)?
  80
  81 * Debian/Ubuntu package?
  82
  83 * Add docstring for every function
  84
  85 * Need some greylag-merge test cases
  86 ** Test RAM requirements on large files
  87
  88 * Basic optimization (just a quick further look for easy speedups)
  89 ** callgrind
  90 ** cachegrind
  91
  92 * Change the way mod limit works?  ({1,4} feature?)
  93
  94
  95 * Update docstrings
  96
  97 * calculate Ion%?
  98
  99 * Further test case updates/adds
 100 ** enzymatic cleavage
 101
 102 * Use SHA1 digests to keep files in sync?
 103
 104 * Check on score stability
 105
 106 * Add A1S2D1C3A1S2D1F4 marking form for isotope regimes?
 107
 108 * Register copyright?
 109
 110
 111 * More testing of SIMR cases
 112 ** no mods
 113 ** single mods
 114 ** multiple mods
 115 ** multiple regimes
 116
 117 * Try to test against MyriMatch (results should be similar)
 118
 119
 120 = M2 =========================================================================
 121
 122
 123 * Evaluate performance differences vs Xtandem?
 124
 125 * Try the MM smart +3 model--much improvement?
 126
 127 * Try the MM precursor mass adjustment--much improvement?  even a good idea?
 128
 129 * Add isotope jitter feature, for Orbitrap.
 130
 131   xtandem considers one C13 if MH>1000, and one/two C13 if MH>1500.  Should we
 132   try to predict this based on the peptide sequence?  MH probably close
 133   enough.  What does MyriMatch do?
 134
 135   new: not clear whether this is really productive, as opposed to just
 136   searching a wider window (looking for 16/17 loss might be, though)
 137
 138 * Implement MyriMatch charge-calling algorithm?
 139
 140 * Implement MyriMatch deisotoping?
 141
 142
 143 * Investigate identification differences between greylag and SEQUEST/MM.
 144
 145 * Pass through the C++ code looking for counts that could conceivably overflow
 146 ** Fix or add assertions
 147
 148 * Add duplicate peptide masking optimization
 149 ** Problem: shuffled versions generally not identical.
 150 *** Limits potential speedup to 25-30%
 151 ** This will obviate the need to detect identical best matches at search time?
 152 ** Fix redundant peptide reporting
 153
 154 * Make a tool to compare greylag vs SEQUEST results by spectrum.
 155
 156   Want gross statistics--how many id's are the same, different, missing, etc.
 157   For each spectrum, want to see what each program did, and how many times the
 158   assigned locus was otherwise id'ed.
 159
 160 * PPM error tolerances (MyriMatch doesn't implement this?)
 161
 162   Not obvious that this is actually helpful.
 163
 164 * Make --estimate work correctly over cluster.  (Currently takes 6 hours to
 165   estimate 60--is this worthwhile?  Could we simply estimate one bag and
 166   multiply by the number of bags??)
 167
 168 * Better shuffling than current model.
 169
 170 * Useful to scale fragment tolerance by charge, too?
 171
 172 * Have --estimate generate a spectrum work count file (*.est?) that can be
 173   used by --part-split to generate evenly sized parts.  (Check that file is
 174   newer than params file and ms2 file arguments, and that all ms2 file
 175   arguments were estimated.)
 176
 177 * Make static '[' mod exclude PCA mods.
 178
 179 * Mine OMSSA and myrimatch for ideas.  Look again at X!Tandem and SEQUEST
 180   papers.
 181
 182 * Add refinement.  (like xtandem?)
 183
 184 * Advanced refinement ideas.  For example, only search a locus for a hit with N
 185   mods if we got a hit for it with 0..N-1 mods (or maybe 0..N-2?).  Or, only
 186   search a locus non-tryptically (or semi-tryptically) if we got a tryptic hit
 187   for it.
 188
 189 ** Investigate current SEQUEST search results to see if this looks feasible.
 190
 191 * Think about ways to get more id's per hour of processing time.
 192
 193 * Try to adapt to instrument accuracy.  Maybe start with a narrow parent mass
 194   range and adaptively widen it.
 195
 196 * Profiling to find slow spots, and for correctness?
 197
 198 * Heavy optimization on inner loop.
 199 ** Try running from both ends simultaneously.
 200 ** Watch cache usage.
 201
 202 * Rigorously check all values coming in from Python (at least by assert).
 203
 204 * Incrementalize the whole program.  Want to be able to take an existing run
 205   and spend more time on it to get more results, possibly concentrating on a
 206   particular kind of modification.
 207
 208 * Try to figure out whether SEQUEST is really searching everything, or whether
 209   it gives up in certain cases like X!Tandem does.
 210
 211 * Isotope S34 and C13 are common (4%, 1%).  Is there a good way to look for
 212   them?  We could look for singleton occurrences pretty cheaply using a delta
 213   mod type procedure.
 214
 215 * Could try switching FP code to use integers instead (but ugh)
 216
 217 * Is there anything we can do with neutral losses?
 218
 219 * Need a way to test spectrum synthesis?