My first backup
[dmvccm.git] / DMVCCM.html
blobcb36d4055fafcf355ec93e04dc844d53b38dfb8f
1 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
2 "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
3 <html xmlns="http://www.w3.org/1999/xhtml"
4 lang="en" xml:lang="en">
5 <head>
6 <title>DMV/CCM &ndash; todo-list / progress</title>
7 <meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>
8 <meta name="generator" content="Org-mode"/>
9 <meta name="generated" content="2008/06/04 15:01:15"/>
10 <meta name="author" content="Kevin Brubeck Unhammer"/>
11 <link rel="stylesheet" type="text/css" href="http://www.student.uib.no/~kun041/org.css">
12 </head><body>
13 <h1 class="title">DMV/CCM &ndash; todo-list / progress</h1>
14 #+OPTIONS: ^:{} H:4 toc:3
18 <div id="table-of-contents">
19 <h2>Table of Contents</h2>
20 <div id="text-table-of-contents">
21 <ul>
22 <li><a href="#sec-1">1 dmvccm report and project</a></li>
23 <li><a href="#sec-2">2 Adjacency and combining it with inner()</a></li>
24 <li><a href="#sec-3">3 [#A] P_STOP for IO/EM</a></li>
25 <li><a href="#sec-4">4 P_CHOOSE for IO/EM</a></li>
26 <li><a href="#sec-5">5 Initialization </a></li>
27 <li><a href="#sec-6">6 [#C] Deferred</a></li>
28 <li><a href="#sec-7">7 Expectation Maximation in IO/DMV-terms</a></li>
29 <li><a href="#sec-8">8 Python-stuff</a></li>
30 </ul>
31 </div>
32 </div>
34 <div id="outline-container-1" class="outline-2">
35 <h2 id="sec-1">1 dmvccm report and project</h2>
36 <div id="text-1">
38 <p><span class="timestamp-kwd">DEADLINE: </span> <span class="timestamp">2008-06-30 Mon</span><br/>
39 But absolute, extended, really-quite-dead-now deadline: August
40 31&hellip;
41 </p><ul>
42 <li>
43 <a href="src/dmv.py">dmv.py</a>
44 </li>
45 <li>
46 <a href="src/io.py">io.py</a>
47 </li>
48 <li>
49 <a href="src/harmonic.py">harmonic.py</a>
50 </li>
51 </ul>
52 </div>
54 </div>
56 <div id="outline-container-2" class="outline-2">
57 <h2 id="sec-2">2 <span class="todo">TODO</span> Adjacency and combining it with inner()</h2>
58 <div id="text-2">
60 <p>Each DMV_Rule now has both a probN and a probA, for
61 adjacencies. inner() needs the correct one in each case.
62 </p>
63 <p>
64 Adjacency gives a problem with duplicate words/tags, eg. in the
65 sentence "a a b". If this has the dependency structure b-&gt;a<sub>0</sub>-&gt;a<sub>1</sub>,
66 then b is non-adjacent to a<sub>0</sub> and should use probN (for the LRStop and
67 the attachment of a<sub>0</sub>), while the other rules should all use
68 probA. But within the e(0,2,b) we can't just say "oh, a has index 0
69 so it's not adjacent to 2", since there's also an a at index 1, and
70 there's also a dependency structure b-&gt;a<sub>1</sub>-&gt;a<sub>0</sub> for that. We want
71 both. And in possibly much more complex versions.
72 </p>
73 <p>
74 Ideas:
75 </p><ul>
76 <li>
77 I first thought of decorating the individual words/tags in a
78 sentence with their indices, and perhaps just duplicating the
79 relevant rules (one for each index of the duplicate tags). But this
80 gives an explosion in attachment rules (although a contained
81 explosion, within the rules used in a sentence; but most sentences
82 will have at least two NN's so it will be a problem).
83 </li>
84 <li>
85 Then, I had a <i>brilliant</i> idea. Just let e(), the helper function of
86 inner(), parametrize for an extra pair of boolean values for whether
87 or not we've attached anything to the left or right yet ("yet"
88 meaning "below"). So now, e() has a chart of the form [s, t, LHS,
89 Lattach, Rattach], and of course e(s,t,LHS) is the sum of the four
90 possible values for (Lattach,Rattach). This makes e() lots more
91 complex and DMV-specific though, so it's been rewritten in
92 inner_dmv() in dmv.py.
93 </li>
94 <li id="sec-2.1"><span class="todo">TODO</span> document this adjacency stuff better<br/>
95 </li>
96 <li id="sec-2.2"><span class="todo">TODO</span> test and debug my brilliant idea<br/>
97 </li>
98 <li id="sec-2.3"><span class="done">DONE</span> implement my brilliant idea.<br/>
99 <span class="timestamp-kwd">CLOSED: </span> <span class="timestamp">2008-06-01 Sun 17:19</span><br/>
100 <a href="src/dmv.py">e(sti) in dmv.py</a>
102 </li>
103 <li id="sec-2.4"><span class="done">DONE</span> [#A] test inner() on sentences with duplicate words<br/>
104 Works with eg. the sentence "h h h"
107 </li>
108 </ul>
109 </div>
111 </div>
113 <div id="outline-container-3" class="outline-2">
114 <h2 id="sec-3">3 <span class="todo">TODO</span> [#A] P_STOP for IO/EM</h2>
115 <div id="text-3">
117 <p><a href="src/dmv.py">dmv-P_STOP</a>
118 Remember: The P<sub>STOP</sub> formula is upside-down (left-to-right also).
119 (In the article..not the thesis)
120 </p>
122 Remember: Initialization makes some "short-cut" rules, these will also
123 have to be updated along with the other P<sub>STOP</sub> updates:
124 </p><ul>
125 <li>
126 b[(NOBAR, n<sub>h</sub>), 'h'] = 1.0 # always
127 </li>
128 <li>
129 b[(RBAR, n<sub>h</sub>), 'h'] = h_.probA # h_ is RBAR stop rule
130 </li>
131 <li>
132 b[(LRBAR, n<sub>h</sub>), 'h'] = h_.probA * _ h_.probA
134 </li>
135 <li id="sec-3.1">How is the P_STOP formula different given other values for dir and adj?<br/>
136 (Presumably, the P<sub>STOP</sub> formula where STOP is True is just the
137 rule-probability of _ h_ -&gt; STOP h_ or h_ -&gt; h STOP, but how does
138 adjacency fit in here?)
141 (And P<sub>STOP</sub>(-STOP|&hellip;) = 1 - P<sub>STOP</sub>(STOP|&hellip;) )
142 </p></li>
143 </ul>
144 </div>
146 </div>
148 <div id="outline-container-4" class="outline-2">
149 <h2 id="sec-4">4 <span class="todo">TODO</span> P_CHOOSE for IO/EM</h2>
150 <div id="text-4">
152 <p>Write the formulas! should be easy?
153 </p></div>
155 </div>
157 <div id="outline-container-5" class="outline-2">
158 <h2 id="sec-5">5 Initialization </h2>
159 <div id="text-5">
161 <p><a href="/Users/kiwibird/Documents/Skole/V08/Probability/dmvccm/src/dmv.py">dmv-inits</a>
162 </p>
164 We do have to go through the corpus, since the probabilities are based
165 on how far away in the sentence arguments are from their heads.
166 </p><ul>
167 <li id="sec-5.1"><span class="todo">TODO</span> Separate initialization to another file? &nbsp;&nbsp;&nbsp;<span class="tag">PRETTIER</span><br/>
168 (It's rather messy.)
169 </li>
170 <li id="sec-5.2"><span class="todo">TOGROK</span> CCM Initialization <br/>
171 P<sub>SPLIT</sub> used here&hellip; how, again?
172 </li>
173 <li id="sec-5.3"><span class="done">DONE</span> DMV Initialization probabilities<br/>
174 (from initialization frequency)
175 </li>
176 <li id="sec-5.4"><span class="done">DONE</span> DMV Initialization frequencies <br/>
177 <span class="timestamp-kwd">CLOSED: </span> <span class="timestamp">2008-05-27 Tue 20:04</span><br/>
178 <ul>
179 <li id="sec-5.4.1">P_STOP <br/>
180 P<sub>STOP</sub> is not well defined by K&amp;M. One possible interpretation given
181 the sentence [det nn vb nn] is
182 <pre>
183 f_{STOP}( STOP|det, L, adj) +1
184 f_{STOP}(-STOP|det, L, adj) +0
185 f_{STOP}( STOP|det, L, non_adj) +1
186 f_{STOP}(-STOP|det, L, non_adj) +0
187 f_{STOP}( STOP|det, R, adj) +0
188 f_{STOP}(-STOP|det, R, adj) +1
190 f_{STOP}( STOP|nn, L, adj) +0
191 f_{STOP}(-STOP|nn, L, adj) +1
192 f_{STOP}( STOP|nn, L, non_adj) +1 # since there's at least one to the left
193 f_{STOP}(-STOP|nn, L, non_adj) +0
194 </pre>
195 <ul>
196 <li id="sec-5.4.1.1"><span class="todo">TODO</span> tweak<br/>
197 <pre>
198 f[head, 'STOP', 'LN'] += (i_h &lt;= 1) # first two words
199 f[head, '-STOP', 'LN'] += (not i_h &lt;= 1)
200 f[head, 'STOP', 'LA'] += (i_h == 0) # very first word
201 f[head, '-STOP', 'LA'] += (not i_h == 0)
202 f[head, 'STOP', 'RN'] += (i_h &gt;= n - 2) # last two words
203 f[head, '-STOP', 'RN'] += (not i_h &gt;= n - 2)
204 f[head, 'STOP', 'RA'] += (i_h == n - 1) # very last word
205 f[head, '-STOP', 'RA'] += (not i_h == n - 1)
206 </pre>
208 <pre>
209 # this one requires some additional rewriting since it
210 # introduces divisions by zero
211 f[head, 'STOP', 'LN'] += (i_h == 1) # second word
212 f[head, '-STOP', 'LN'] += (not i_h &lt;= 1) # not first two
213 f[head, 'STOP', 'LA'] += (i_h == 0) # first word
214 f[head, '-STOP', 'LA'] += (not i_h == 0) # not first
215 f[head, 'STOP', 'RN'] += (i_h == n - 2) # second-to-last
216 f[head, '-STOP', 'RN'] += (not i_h &gt;= n - 2) # not last two
217 f[head, 'STOP', 'RA'] += (i_h == n - 1) # last word
218 f[head, '-STOP', 'RA'] += (not i_h == n - 1) # not last
219 </pre>
221 <pre>
222 f[head, 'STOP', 'LN'] += (i_h == 1) # second word
223 f[head, '-STOP', 'LN'] += (not i_h == 1) # not second
224 f[head, 'STOP', 'LA'] += (i_h == 0) # first word
225 f[head, '-STOP', 'LA'] += (not i_h == 0) # not first
226 f[head, 'STOP', 'RN'] += (i_h == n - 2) # second-to-last
227 f[head, '-STOP', 'RN'] += (not i_h == n - 2) # not second-to-last
228 f[head, 'STOP', 'RA'] += (i_h == n - 1) # last word
229 f[head, '-STOP', 'RA'] += (not i_h == n - 1) # not last
230 </pre>
232 "all words take the same number of arguments" interpreted as
233 <pre>
234 for all heads:
235 p_STOP(head, 'STOP', 'LN') = 0.3
236 p_STOP(head, 'STOP', 'LA') = 0.5
237 p_STOP(head, 'STOP', 'RN') = 0.4
238 p_STOP(head, 'STOP', 'RA') = 0.7
239 </pre>
240 (which we easily may tweak in init_zeros())
241 </li>
242 </ul>
243 </li>
244 <li id="sec-5.4.2">P_CHOOSE<br/>
245 Go through the corpus, counting distances between heads and
246 arguments. In [det nn vb nn], we give
247 <ul>
248 <li>
249 f<sub>CHOOSE</sub>(nn|det, R) +1/1 + C
250 </li>
251 <li>
252 f<sub>CHOOSE</sub>(vb|det, R) +1/2 + C
253 </li>
254 <li>
255 f<sub>CHOOSE</sub>(nn|det, R) +1/3 + C
256 <ul>
257 <li>
258 If this were the full corpus, P<sub>CHOOSE</sub>(nn|det, R) would have
259 (1+1/3+2C) / sum_a f<sub>CHOOSE</sub>(a|det, R)
261 </li>
262 </ul>
263 </li>
264 </ul>
266 <p>The ROOT gets "each argument with equal probability", so in a sentence
267 of three words, 1/3 for each (in [nn vb nn], 'nn' gets 2/3). Basically
268 just a frequency count of the corpus&hellip;
269 </p></li>
270 </ul>
271 </li>
272 </ul>
273 </div>
275 </div>
277 <div id="outline-container-6" class="outline-2">
278 <h2 id="sec-6">6 [#C] Deferred</h2>
279 <div id="text-6">
281 <ul>
282 <li id="sec-6.1"><span class="todo">TODO</span> inner_dmv() should disregard rules with heads not in sent &nbsp;&nbsp;&nbsp;<span class="tag">OPTIMIZE</span><br/>
283 If the sentence is "nn vbd det nn", we should not even look at rules
284 where
285 <pre>
286 rule.head() not in "nn vbd det nn".split()
287 </pre>
288 This is ruled out by getting rules from g.rules(LHS, sent).
291 Also, we optimize this further by saying we don't even recurse into
292 attachment rules where
293 <pre>
294 rule.head() not in sent[ s :r+1]
295 rule.head() not in sent[r+1:t+1]
296 </pre>
297 meaning, if we're looking at the span "vbd det", we only use
298 attachment rules where both daughters are members of ['vbd','det']
299 (although we don't (yet) care about removing rules that rewrite to the
300 same tag if there are no duplicate tags in the span, etc., that would
301 be a lot of trouble for little potential gain).
302 </p></li>
303 <li id="sec-6.2"><span class="todo">TODO</span> when reestimating P_STOP etc, remove rules with p &lt; epsilon &nbsp;&nbsp;&nbsp;<span class="tag">OPTIMIZE</span><br/>
304 </li>
305 <li id="sec-6.3"><span class="todo">TODO</span> inner_dmv, short ranges and impossible attachment &nbsp;&nbsp;&nbsp;<span class="tag">OPTIMIZE</span><br/>
306 If s-t &lt;= 2, there can be only one attachment below, so don't recurse
307 with both Lattach=True and Rattach=True.
310 If s-t &lt;= 1, there can be no attachment below, so only recurse with
311 Lattach=False, Rattach=False.
312 </p>
314 Put this in the loop under rewrite rules (could also do it in the STOP
315 section, but that would only have an effect on very short sentences).
316 </p></li>
317 <li id="sec-6.4"><span class="todo">TODO</span> clean up the module files &nbsp;&nbsp;&nbsp;<span class="tag">PRETTIER</span><br/>
318 Is there better way to divide dmv and harmonic? There's a two-way
319 dependency between the modules. Guess there could be a third file that
320 imports both the initialization and the actual EM stuff, while a file
321 containing constants and classes could be imported by all others:
322 <pre>
323 dmv.py imports dmv_EM.py imports dmv_classes.py
324 dmv.py imports dmv_inits.py imports dmv_classes.py
325 </pre>
327 </li>
328 <li id="sec-6.5"><span class="todo">TOGROK</span> Some (tagged) sentences are bound to come twice &nbsp;&nbsp;&nbsp;<span class="tag">OPTIMIZE</span><br/>
329 Eg, first sort and count, so that the corpus
330 [['nn','vbd','det','nn'],
331 ['vbd','nn','det','nn'],
332 ['nn','vbd','det','nn']]
333 becomes
334 [(['nn','vbd','det','nn'],2),
335 (['vbd','nn','det','nn'],1)]
336 and then in each loop through sentences, make sure we handle the
337 frequency correctly.
340 Is there much to gain here?
341 </p>
342 </li>
343 <li id="sec-6.6"><span class="todo">TOGROK</span> tags as numbers or tags as strings? &nbsp;&nbsp;&nbsp;<span class="tag">OPTIMIZE</span><br/>
344 Need to clean up the representation.
347 Stick with tag-strings in initialization then switch to numbers for
348 IO-algorithm perhaps? Can probably afford more string-matching in
349 initialization..
350 </p></li>
351 </ul>
352 </div>
354 </div>
356 <div id="outline-container-7" class="outline-2">
357 <h2 id="sec-7">7 Expectation Maximation in IO/DMV-terms</h2>
358 <div id="text-7">
360 <p>inner(s,t,LHS) calculates the expected number of trees headed by LHS
361 from s to t (sentence positions). This uses the P_STOP and P_CHOOSE
362 values, which have been conveniently distributed into CNF rules as
363 probN and probA (non-adjacent and adjacent probabilites).
364 </p>
366 When re-estimating, we use the expected values from inner() to get new
367 values for P_STOP and P_CHOOSE. When we've re-estimated for the entire
368 corpus, we distribute P_STOP and P_CHOOSE into the CNF rules again, so
369 that in the next round we use new probN and probA to find
370 inner-probabilites.
371 </p>
373 The distribution of P_STOP and P_CHOOSE into CNF rules also happens in
374 init_normalize() (here along with the creation of P_STOP and
375 P_CHOOSE); P_STOP is used to create CNF rules where one branch of the
376 rule is STOP, P_CHOOSE is used to create rules of the form
377 <pre>
378 h -&gt; h _a_
379 h_ -&gt; h_ _a_
380 </pre>
381 </p>
383 Since "adjacency" is not captured in regular CNF rules, we need two
384 probabilites for each rule, and inner() has to know when to use which.
385 </p>
386 <ul>
387 <li id="sec-7.1"><span class="todo">TODO</span> Corpus access<br/>
388 </li>
389 <li id="sec-7.2"><span class="todo">TOGROK</span> sentences or rules as the "outer loop"? &nbsp;&nbsp;&nbsp;<span class="tag">OPTIMIZE</span><br/>
390 In regard to the E/M-step, finding P<sub>STOP</sub>, P<sub>CHOOSE</sub>.
393 </li>
394 </ul>
395 </div>
397 </div>
399 <div id="outline-container-8" class="outline-2">
400 <h2 id="sec-8">8 Python-stuff</h2>
401 <div id="text-8">
403 <ul>
404 <li>
405 <a href="src/pseudo.py">pseudo.py</a>
406 </li>
407 <li>
408 <a href="http://nltk.org/doc/en/structured-programming.html">http://nltk.org/doc/en/structured-programming.html</a> recursive dynamic
409 </li>
410 <li>
411 <a href="http://nltk.org/doc/en/advanced-parsing.html">http://nltk.org/doc/en/advanced-parsing.html</a>
412 </li>
413 <li>
414 <a href="http://jaynes.colorado.edu/PythonIdioms.html">http://jaynes.colorado.edu/PythonIdioms.html</a>
417 </li>
418 </ul>
419 </div>
420 </div>
421 <div id="postamble"><p class="author"> Author: Kevin Brubeck Unhammer
422 <a href="mailto:K.BrubeckUnhammer at student uva nl ">&lt;K.BrubeckUnhammer at student uva nl &gt;</a>
423 </p>
424 <p class="date"> Date: 2008/06/04 15:01:15</p>
425 </div><p class="postamble">Skrive vha. emacs + <a href='http://orgmode.org/'>org-mode</a></p></body>
426 </html>