1 GREYLAG TODO LIST -*-outline-*-
4 ==============================================================================
8 1. Replace SEQUEST at SIMR with something at least as good.
9 2. Do better than SEQUEST for things that SIMR cares about.
10 3. Showcase Python w/C-ish inner loop code implementation strategy.
11 4. Try to take the best ideas from other similar programs.
12 5. Greylag as a pedagogical artifact and foundation for further
15 ==============================================================================
20 * Good first impression
22 * Handles at least LCQ input
23 * Handles nonspecific cleavage
24 * Generates SQT output, usable in our pipeline, at least for non-N15 runs
25 * Decent performance/efficiency on our clusters
26 * Basic how-to-use-it documentation
30 * Handles tryptic/etc cleavage
32 * Basic public website/source release/git archive
33 * Documentation (asciidoc/man pages)
39 * Clean up and check in working part of test suite
41 * Check that greylag-* errors if no input files specified
43 * Look for (user-visible) changes that are easy to make now but harder later...
45 * Add missing-feature matrix (wrt other programs, etc)
47 * Create basic documentation in reST format
48 ** Need HTML (web and standalone), PDF?
49 *** index page (what is it? why should you care? links)
53 *** theory of operation document
54 **** basics of id search
55 **** explanation of greylag algorithm
56 **** greylag usage scenarios (solo, cluster)
57 **** how does MuDPIT work?
58 **** design choices, technology rationale (upsides/downsides)
59 ** Makefile, install-to-web targets
62 * Generate a set of sample .conf files
63 ** One long, commented template file?
66 ** What it does and doesn't yet do
70 * Try DTASelect on greylag SQT files
73 ** spectrum viewing works?
77 = M1 =========================================================================
80 ** check in for complete release
84 * Try greylag-validate refinement
86 * Implement semi-tryptic cleavage
87 ** Reputed to be almost as good as tryptic (valid peptide count)?
89 * Debian/Ubuntu package?
91 * greylag-validate parent tolerance analysis
93 * Add docstring for every function
95 * Need some greylag-merge test cases
96 ** Test RAM requirements on large files
98 * Basic optimization (just a quick further look for easy speedups)
102 * Change the way mod limit works? ({1,4} feature?)
109 * Further test case updates/adds
110 ** enzymatic cleavage
112 * Use SHA1 digests to keep files in sync?
114 * Check on score stability
116 * Add A1S2D1C3A1S2D1F4 marking form for isotope regimes?
118 * Register copyright?
121 * More testing of SIMR cases
127 * Think about how to implement LAFs PTM pipeline
130 * Try to test against MyriMatch (results should be similar)
133 * How should greylag-validate handle I/L, P/V, etc?
136 = M2 =========================================================================
139 * Evaluate performance differences vs Xtandem?
141 * Try the MM smart +3 model--much improvement?
143 * Try the MM precursor mass adjustment--much improvement? even a good idea?
145 * Add isotope jitter feature, for Orbitrap.
147 xtandem considers one C13 if MH>1000, and one/two C13 if MH>1500. Should we
148 try to predict this based on the peptide sequence? MH probably close
149 enough. What does MyriMatch do?
152 * Implement MyriMatch charge-calling algorithm?
154 * Implement MyriMatch deisotoping?
157 * Investigate identification differences between greylag and SEQUEST/MM.
159 * Clean up [,] (N/C-terminal) mass regime calculations
161 * Pass through the C++ code looking for counts that could conceivably overflow
162 ** Fix or add assertions
164 * Add duplicate peptide masking optimization
165 ** Problem: shuffled versions generally not identical.
166 *** Limits potential speedup to 25-30%
167 ** This will obviate the need to detect identical best matches at search time?
168 ** Fix redundant peptide reporting
170 * Make a tool to compare greylag vs SEQUEST results by spectrum.
172 Want gross statistics--how many id's are the same, different, missing, etc.
173 For each spectrum, want to see what each program did, and how many times the
174 assigned locus was otherwise id'ed.
176 * PPM error tolerances (MyriMatch doesn't implement this?)
178 * Make --estimate work correctly over cluster. (Currently takes 6 hours to
179 estimate 60--is this worthwhile? Could we simply estimate one bag and
180 multiply by the number of bags??)
182 * Better shuffling than current model.
184 * Useful to scale fragment tolerance by charge, too?
186 * Have --estimate generate a spectrum work count file (*.est?) that can be
187 used by --part-split to generate evenly sized parts. (Check that file is
188 newer than params file and ms2 file arguments, and that all ms2 file
189 arguments were estimated.)
191 * Fix "cleavage C-terminal mass change" issue. Should this be interpreted as
192 MONO, ! (first fragment regime), or by regime. Look for similar problems
195 * Make static '[' mod exclude PCA mods.
197 * Mine OMSSA and myrimatch for ideas. Look again at X!Tandem and SEQUEST
200 * Add refinement. (like xtandem?)
202 * Advanced refinement ideas. For example, only search a locus for a hit with N
203 mods if we got a hit for it with 0..N-1 mods (or maybe 0..N-2?). Or, only
204 search a locus non-tryptically (or semi-tryptically) if we got a tryptic hit
207 ** Investigate current SEQUEST search results to see if this looks feasible.
209 * Think about ways to get more id's per hour of processing time.
211 * Try to adapt to instrument accuracy. Maybe start with a narrow parent mass
212 range and adaptively widen it.
214 * Profiling to find slow spots, and for correctness?
216 * Heavy optimization on inner loop.
217 ** Try running from both ends simultaneously.
218 ** Watch cache usage.
220 * Rigorously check all values coming in from Python (at least by assert).
222 * Incrementalize the whole program. Want to be able to take an existing run
223 and spend more time on it to get more results, possibly concentrating on a
224 particular kind of modification.
226 * Try to figure out whether SEQUEST is really searching everything, or whether
227 it gives up in certain cases like X!Tandem does.
229 * Isotope S34 and C13 are common (4%, 1%). Is there a good way to look for
230 them? We could look for singleton occurrences pretty cheaply using a delta
233 * Could try switching FP code to use integers instead (but ugh)
235 * Is there anything we can do with neutral losses?
237 * Need a way to test spectrum synthesis?