TODO

   1 GREYLAG TODO LIST            -*- mode: text;-*-
   2
   3         $Id$
   4
   5 OVERALL GOALS
   6
   7 1.  Replace SEQUEST at SIMR with something at least as good.
   8 2.  Do better than SEQUEST for things that SIMR cares about.
   9 3.  Showcase Python w/C-ish inner loop code strategy.
  10 4.  Do better than SEQUEST generally.
  11 5.  Greylag as a pedagogical artifact and foundation for further
  12     experimentation.
  13
  14
  15
  16 SPECIFIC TASKS
  17
  18
  19 - Move iteration over sequences into C.  (generate_cleavage_points and
  20   search_all are killing us)  Can we still generate the cleavage points (once)
  21   in Python and store them on the C side?  (store as deltas?)
  22
  23 - Time a real no-mod search vs SEQUEST (and xtandem?), is time reasonable?
  24   MAKE SURE PARAMS ARE COMPARABLE!
  25
  26 - Time a real mod search vs SEQUEST (and xtandem?), is time reasonable?
  27   MAKE SURE PARAMS ARE COMPARABLE!
  28
  29 - Make a tool to compare greylag vs SEQUEST results by spectrum.  Want gross
  30   statistics--how many id's are the same, different, missing, etc.  For each
  31   spectrum, want to see what each program did, and how many times the assigned
  32   locus was otherwise id'ed.
  33
  34 - Investigate identification differences between greylag and SEQUEST.
  35
  36 - Design and implement tracing of mass regime/PCA/fixed and non-fixed
  37   deltas/etc into output file.  Try to stay compatible with xtandem.
  38
  39 - Add isotope jitter feature, for Orbitrap.  (xtandem considers one C13 if
  40   MH>1000, and one/two C13 if MH>1500.)  Should we try to predict this based
  41   on the peptide sequence?  MH probably close enough.
  42
  43 - Add PPM feature.
  44
  45 - Code cleanup, especially in new Python code.  Maybe put some stuff in
  46   classes.  Could split into multiple source files.
  47
  48 - Create DTASelect.txt output, either directly or by conversion from XML
  49   output.
  50
  51 - Fix "cleavage C-terminal mass change" issue.  Should this be interpreted as
  52   MONO, ! (first fragment regime), or by regime.  Look for similar problems
  53   elsewhere.
  54
  55 - Mine OMSSA and myrimatch for ideas.  Look again at SEQUEST papers.
  56
  57 - Make static '[' mod exclude PCA mods.
  58
  59 - Need tool to compare two runs, for regression testing purposes.
  60
  61 - Look at moving C++ code to C+ctypes, or maybe pyrex?
  62
  63 - Add refinement.  (like xtandem?)
  64
  65 - Advanced refinement ideas.  For example, only search a locus for a hit with N
  66   mods if we got a hit for it with 0..N-1 mods (or maybe 0..N-2?).  Or, only
  67   search a locus non-tryptically (or semi-tryptically) if we got a tryptic hit
  68   for it.
  69
  70   - Investigate current SEQUEST search results to see if this looks feasible.
  71
  72 - Think about ways to get more id's per hour of processing time.
  73
  74 - Try to adapt to instrument accuracy.  Maybe start with a narrow parent mass
  75   range and adaptively widen it.
  76
  77 - Profiling to find slow spots, and for correctness?
  78
  79 - Try using the Intel compiler?
  80
  81 - Heavy optimization on inner loop.  Try running from both ends
  82   simultaneously.
  83
  84 - Should we try to guarantee that searching is equivalent for all mass regimes,
  85   to make comparisons more valid?
  86
  87 - Eval speedup: make intensity information integer, or otherwise store it in
  88   log form so we can add instead of multiplying?
  89
  90 - Try to switch FP code to use integers instead?
  91
  92 - Rigorously check all values coming in from Python (at least by assert).
  93
  94 - We can now pre-build a peptide index if we want to.  The main utility of this
  95   is that it would allow us to avoid searching a spectrum against the same
  96   peptide multiple times (saving perhaps 30% runtime for one real database).
  97   Alternatively, maybe we could just generate a description of peptides to be
  98   masked out.
  99
 100 - Incrementalize the whole program.  Want to be able to take an existing run
 101   and spend more time on it to get more results, possibly concentrating on a
 102   particular kind of modification.
 103
 104 - Try to figure out whether SEQUEST is really searching everything, or whether
 105   it gives up in certain cases like X!Tandem does.
 106
 107 - Isotope S34 and C13 are common (4%, 1%).  Is there a good way to look for
 108   them?  We could look for singleton occurrences pretty cheaply using a delta
 109   mod type procedure.
 110
 111 - Splitting idea: Rather than having all parts be equal, maybe its better for
 112   the parts processed first to be bigger, with smallest parts processed last,
 113   so that they can fill in the final gaps (leading all processors to finish at
 114   about the same time).  What should the split curve look like?  Linear, but
 115   what slope?  No split should be smaller than one spectrum (after
 116   filtering).
 117
 118 - Look carefully at the statistics code.  Problems?
 119
 120 - Implement "cyclic permutation" of xtandem.
 121
 122 - Possible generation optimization: Figure out the maximum number of mods,
 123   which would be the number that could be added to the smallest, lightest
 124   peptide without exceeding the mass of the largest spectrum parent mass.
 125   Probably not worth doing?  Similarly, if we know all bags of size N are too
 126   large, and all deltas are positive, we can skip larger bags.
 127
 128 - Can we make the parts restartable?  If so, maybe this could be used to load
 129   balance, recover from crashes, etc.
 130
 131 - Is NOTHROW faster or slower?
 132
 133 - If we see a good hit for a spectrum, we could try to see if there's an
 134   identifiable tag.  If so, could restrict further searching to peptides with
 135   that tag.
 136
 137 - Motif-based differential deltas (like xtandem).
 138
 139
 140
 141 PUTATIVE INSIGHTS
 142
 143 - At least for deeper mod searches, evaluation time for real vs synthetic
 144   spectra swamps everything else.  (Generation of synthetic spectra is
 145   noticeable, at about 15%.)
 146
 147 - This may mean that ordering spectra by parent mass is pointless?!
 148
 149 - We can afford to be a little sloppy in how we generate the comparisons
 150   (as long as we're not generating duplicates, of course).
 151
 152 - The number of leaves at level N is probably about N times more than all of
 153   the previous N-1 levels put together.
 154
 155 - SEQUEST does its FFT step only for a fixed number (500?) of candidate matches
 156   for each spectrum.  If the number of matches explodes with increasing depth,
 157   does this imply that only their preliminary scoring algorithm really matters
 158   for mod searches?
 159
 160 - X!Tandem limits modification combinations searched to 2**12 or so.  So for
 161   deeper searches they just silently give up.
 162
 163 - The way X!Tandem quantizes peaks leads to noticeable quantization error.
 164
 165 - What is this C+57 mod called?  Carboxyamidomethyl?  +C2H3ON!
 166
 167 - Is there anything we can do with neutral losses?
 168
 169 - test spectrum synthesis
 170
 171 - test semi-tryptic cleavage
 172
 173 - double-check handling of FP arithmetic using epsilons (no ==, no strict <)
 174
 175
 176
 177 Current sloccount comparison (generated using David A. Wheeler's 'SLOCCount'):
 178 greylag: cpp:   898 py: 1400 (+336 sh to set up parallel jobs at SIMR)
 179 xtandem: cpp: 13058 (+ 1271 for parallel tandem -> 14329)
 180 omssa:   cpp:  7583 (plus an unknown, possibly large number from the NCBI
 181                      toolkits [33 distinct headers])
 182                     (the toolkits are 1000000 sloc, 65% cpp, 34% c)
 183 XXX:     cpp:  6534 (not counting expat code)
 184
 185
 186 my source print command:
 187
 188   enscript -E -B -3 -r -s 0 --borders -fCourier4.8 --mark-wrapped-lines=arrow
 189            --margins=:30::