1 GREYLAG TODO LIST -*-outline-*-
4 ==============================================================================
8 1. Replace SEQUEST at SIMR with something at least as good.
9 2. Do better than SEQUEST for things that SIMR cares about.
10 3. Showcase Python w/C-ish inner loop code implementation strategy.
11 4. Try to take the best ideas from other similar programs.
12 5. Greylag as a pedagogical artifact and foundation for further
15 ==============================================================================
20 * Good first impression
22 * Handles at least LCQ input
23 * Handles nonspecific cleavage
24 * Generates SQT output, usable in our pipeline, at least for non-N15 runs
25 * Decent performance/efficiency on our clusters
26 * Basic how-to-use-it documentation
30 * Handles tryptic/etc cleavage
32 * Basic public website/source release/git archive
33 * Documentation (asciidoc/man pages)
39 * Clean up and check in working part of test suite
41 * Check that greylag-* errors if no input files specified
43 * Look for (user-visible) changes that are easy to make now but harder later...
45 * Add missing-feature matrix (wrt other programs, etc)
47 * Create basic documentation in reST format
48 ** Need HTML (web and standalone), PDF?
49 *** index page (what is it? why should you care? links)
53 *** theory of operation document
54 **** basics of id search
55 **** explanation of greylag algorithm
56 **** greylag usage scenarios (solo, cluster)
57 **** how does MuDPIT work?
58 **** design choices, technology rationale (upsides/downsides)
59 ** Makefile, install-to-web targets
62 * Generate a set of sample .conf files
63 ** One long, commented template file?
66 ** What it does and doesn't yet do
71 = M1 =========================================================================
74 ** check in for complete release
78 * Implement semi-tryptic cleavage
79 ** Reputed to be almost as good as tryptic (valid peptide count)?
81 * Debian/Ubuntu package?
83 * Add docstring for every function
85 * Need some greylag-merge test cases
86 ** Test RAM requirements on large files
88 * Basic optimization (just a quick further look for easy speedups)
92 * Change the way mod limit works? ({1,4} feature?)
99 * Further test case updates/adds
100 ** enzymatic cleavage
102 * Use SHA1 digests to keep files in sync?
104 * Check on score stability
106 * Add A1S2D1C3A1S2D1F4 marking form for isotope regimes?
108 * Register copyright?
111 * More testing of SIMR cases
117 * Try to test against MyriMatch (results should be similar)
120 = M2 =========================================================================
123 * Evaluate performance differences vs Xtandem?
125 * Try the MM smart +3 model--much improvement?
127 * Try the MM precursor mass adjustment--much improvement? even a good idea?
129 * Add isotope jitter feature, for Orbitrap.
131 xtandem considers one C13 if MH>1000, and one/two C13 if MH>1500. Should we
132 try to predict this based on the peptide sequence? MH probably close
133 enough. What does MyriMatch do?
135 new: not clear whether this is really productive, as opposed to just
136 searching a wider window (looking for 16/17 loss might be, though)
138 * Implement MyriMatch charge-calling algorithm?
140 * Implement MyriMatch deisotoping?
143 * Investigate identification differences between greylag and SEQUEST/MM.
145 * Pass through the C++ code looking for counts that could conceivably overflow
146 ** Fix or add assertions
148 * Add duplicate peptide masking optimization
149 ** Problem: shuffled versions generally not identical.
150 *** Limits potential speedup to 25-30%
151 ** This will obviate the need to detect identical best matches at search time?
152 ** Fix redundant peptide reporting
154 * Make a tool to compare greylag vs SEQUEST results by spectrum.
156 Want gross statistics--how many id's are the same, different, missing, etc.
157 For each spectrum, want to see what each program did, and how many times the
158 assigned locus was otherwise id'ed.
160 * PPM error tolerances (MyriMatch doesn't implement this?)
162 Not obvious that this is actually helpful.
164 * Make --estimate work correctly over cluster. (Currently takes 6 hours to
165 estimate 60--is this worthwhile? Could we simply estimate one bag and
166 multiply by the number of bags??)
168 * Better shuffling than current model.
170 * Useful to scale fragment tolerance by charge, too?
172 * Have --estimate generate a spectrum work count file (*.est?) that can be
173 used by --part-split to generate evenly sized parts. (Check that file is
174 newer than params file and ms2 file arguments, and that all ms2 file
175 arguments were estimated.)
177 * Make static '[' mod exclude PCA mods.
179 * Mine OMSSA and myrimatch for ideas. Look again at X!Tandem and SEQUEST
182 * Add refinement. (like xtandem?)
184 * Advanced refinement ideas. For example, only search a locus for a hit with N
185 mods if we got a hit for it with 0..N-1 mods (or maybe 0..N-2?). Or, only
186 search a locus non-tryptically (or semi-tryptically) if we got a tryptic hit
189 ** Investigate current SEQUEST search results to see if this looks feasible.
191 * Think about ways to get more id's per hour of processing time.
193 * Try to adapt to instrument accuracy. Maybe start with a narrow parent mass
194 range and adaptively widen it.
196 * Profiling to find slow spots, and for correctness?
198 * Heavy optimization on inner loop.
199 ** Try running from both ends simultaneously.
200 ** Watch cache usage.
202 * Rigorously check all values coming in from Python (at least by assert).
204 * Incrementalize the whole program. Want to be able to take an existing run
205 and spend more time on it to get more results, possibly concentrating on a
206 particular kind of modification.
208 * Try to figure out whether SEQUEST is really searching everything, or whether
209 it gives up in certain cases like X!Tandem does.
211 * Isotope S34 and C13 are common (4%, 1%). Is there a good way to look for
212 them? We could look for singleton occurrences pretty cheaply using a delta
215 * Could try switching FP code to use integers instead (but ugh)
217 * Is there anything we can do with neutral losses?
219 * Need a way to test spectrum synthesis?