a bit better, but still not sure P_STOP does everything right
[dmvccm.git] / DMVCCM.html
bloba467d641ec11cb35e6443506dcb6f2fcfd344594
1 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
2 "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
3 <html xmlns="http://www.w3.org/1999/xhtml"
4 lang="en" xml:lang="en">
5 <head>
6 <title>DMV/CCM &ndash; todo-list / progress</title>
7 <meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>
8 <meta name="generator" content="Org-mode"/>
9 <meta name="generated" content="2008/06/08 12:43:26"/>
10 <meta name="author" content="Kevin Brubeck Unhammer"/>
11 <link rel="stylesheet" type="text/css" href="http://www.student.uib.no/~kun041/org.css">
12 </head><body>
13 <h1 class="title">DMV/CCM &ndash; todo-list / progress</h1>
16 <div id="table-of-contents">
17 <h2>Table of Contents</h2>
18 <div id="text-table-of-contents">
19 <ul>
20 <li><a href="#sec-1">1 dmvccm report and project</a></li>
21 <li><a href="#sec-2">2 [#A] P_STOP and P_CHOOSE for IO/EM</a></li>
22 <li><a href="#sec-3">3 Initialization </a></li>
23 <li><a href="#sec-4">4 [#C] Deferred</a></li>
24 <li><a href="#sec-5">5 Adjacency and combining it with inner()</a></li>
25 <li><a href="#sec-6">6 Expectation Maximation in IO/DMV-terms</a></li>
26 <li><a href="#sec-7">7 Python-stuff</a></li>
27 <li><a href="#sec-8">8 Git</a></li>
28 </ul>
29 </div>
30 </div>
32 <div id="outline-container-1" class="outline-2">
33 <h2 id="sec-1">1 dmvccm report and project</h2>
34 <div id="text-1">
36 <p><span class="timestamp-kwd">DEADLINE: </span> <span class="timestamp">2008-06-30 Mon</span><br/>
37 But absolute, extended, really-quite-dead-now deadline: August
38 31&hellip;
39 </p><ul>
40 <li>
41 <a href="src/dmv.py">dmv.py</a>
42 </li>
43 <li>
44 <a href="src/io.py">io.py</a>
45 </li>
46 <li>
47 <a href="src/harmonic.py">harmonic.py</a>
50 </li>
51 </ul>
52 </div>
54 </div>
56 <div id="outline-container-2" class="outline-2">
57 <h2 id="sec-2">2 <span class="todo">TODO</span> [#A] P_STOP and P_CHOOSE for IO/EM</h2>
58 <div id="text-2">
60 <p><a href="src/dmv.py">dmv-P_STOP</a>
61 Remember: The P<sub>STOP</sub> formula is upside-down (left-to-right also).
62 (In the article..not the thesis)
63 </p>
64 <p>
65 Remember: Initialization makes some "short-cut" rules, these will also
66 have to be updated along with the other P<sub>STOP</sub> updates:
67 </p><ul>
68 <li>
69 b[(NOBAR, n<sub>h</sub>), 'h'] = 1.0 # always
70 </li>
71 <li>
72 b[(RBAR, n<sub>h</sub>), 'h'] = h_.probA # h_ is RBAR stop rule
73 </li>
74 <li>
75 b[(LRBAR, n<sub>h</sub>), 'h'] = h_.probA * _ h_.probA
77 </li>
78 <li id="sec-2.1">P_STOP formulas for various dir and adj:<br/>
79 Assuming this:
80 <ul>
81 <li>
82 P<sub>STOP</sub>(STOP|h,L,non_adj) = &sum;<sub>corpus</sub> &sum;<sub>s&lt;loc(h)</sub> &sum;<sub>t</sub>
83 inner(s,t,_h_, &hellip;) / &sum;<sub>corpus</sub> &sum;<sub>s&lt;loc(h)</sub> &sum;<sub>t</sub>
84 inner(s,t,h_, &hellip;)
85 </li>
86 <li>
87 P<sub>STOP</sub>(STOP|h,L,adj) = &sum;<sub>corpus</sub> &sum;<sub>s=loc(h)</sub> &sum;<sub>t</sub>
88 inner(s,t,_h_, &hellip;) / &sum;<sub>corpus</sub> &sum;<sub>s=loc(h)</sub> &sum;<sub>t</sub>
89 inner(s,t,h_, &hellip;)
90 </li>
91 <li>
92 P<sub>STOP</sub>(STOP|h,R,non_adj) = &sum;<sub>corpus</sub> &sum;<sub>s</sub> &sum;<sub>t&gt;loc(h)</sub>
93 inner(s,t,_h_, &hellip;) / &sum;<sub>corpus</sub> &sum;<sub>s</sub> &sum;<sub>t&gt;loc(h)</sub>
94 inner(s,t,h_, &hellip;)
95 </li>
96 <li>
97 P<sub>STOP</sub>(STOP|h,R,adj) = &sum;<sub>corpus</sub> &sum;<sub>s</sub> &sum;<sub>t=loc(h)</sub>
98 inner(s,t,_h_, &hellip;) / &sum;<sub>corpus</sub> &sum;<sub>s</sub> &sum;<sub>t=loc(h)</sub>
99 inner(s,t,h_, &hellip;)
101 </li>
102 </ul>
104 <p>(And P<sub>STOP</sub>(-STOP|&hellip;) = 1 - P<sub>STOP</sub>(STOP|&hellip;) )
105 </p></li>
106 <li id="sec-2.2">P_CHOOSE formula:<br/>
107 Assuming this:
108 <ul>
109 <li>
110 P<sub>CHOOSE</sub>(a|h,L) = &sum;<sub>corpus</sub> &sum;<sub>s&lt;loc(h)</sub> &sum;<sub>t=loc(h)</sub> &sum;<sub>r&lt;t</sub>
111 inner(s,r,_a_, &hellip;) / &sum;<sub>corpus</sub> &sum;<sub>s&lt;loc(h)</sub> &sum;<sub>t=loc(h)</sub>
112 inner(s,t,h_,&hellip;)
113 <ul>
114 <li>
115 or t &gt;= loc(h)?
116 </li>
117 </ul>
118 </li>
119 <li>
120 P<sub>CHOOSE</sub>(a|h,R) = &sum;<sub>corpus</sub> &sum;<sub>s=loc(h)</sub> &sum;<sub>r&gt;s</sub>
121 inner(s,r,_a_,&hellip;) / &sum;<sub>corpus</sub> &sum;<sub>s=loc(h)</sub> &sum;<sub>t</sub>
122 inner(s,t,h, &hellip;)
123 <ul>
124 <li>
125 or s =&gt; loc(h)?
128 </li>
129 </ul>
130 </li>
131 </ul>
132 </li>
133 </ul>
134 </div>
136 </div>
138 <div id="outline-container-3" class="outline-2">
139 <h2 id="sec-3">3 Initialization </h2>
140 <div id="text-3">
142 <p><a href="/Users/kiwibird/Documents/Skole/V08/Probability/dmvccm/src/dmv.py">dmv-inits</a>
143 </p>
145 We do have to go through the corpus, since the probabilities are based
146 on how far away in the sentence arguments are from their heads.
147 </p><ul>
148 <li id="sec-3.1"><span class="todo">TODO</span> Separate initialization to another file? &nbsp;&nbsp;&nbsp;<span class="tag">PRETTIER</span><br/>
149 (It's rather messy.)
150 </li>
151 <li id="sec-3.2"><span class="todo">TOGROK</span> CCM Initialization <br/>
152 P<sub>SPLIT</sub> used here&hellip; how, again?
153 </li>
154 <li id="sec-3.3"><span class="done">DONE</span> DMV Initialization probabilities<br/>
155 (from initialization frequency)
156 </li>
157 <li id="sec-3.4"><span class="done">DONE</span> DMV Initialization frequencies <br/>
158 <span class="timestamp-kwd">CLOSED: </span> <span class="timestamp">2008-05-27 Tue 20:04</span><br/>
159 <ul>
160 <li id="sec-3.4.1">P_STOP <br/>
161 P<sub>STOP</sub> is not well defined by K&amp;M. One possible interpretation given
162 the sentence [det nn vb nn] is
163 <pre>
164 f_{STOP}( STOP|det, L, adj) +1
165 f_{STOP}(-STOP|det, L, adj) +0
166 f_{STOP}( STOP|det, L, non_adj) +1
167 f_{STOP}(-STOP|det, L, non_adj) +0
168 f_{STOP}( STOP|det, R, adj) +0
169 f_{STOP}(-STOP|det, R, adj) +1
171 f_{STOP}( STOP|nn, L, adj) +0
172 f_{STOP}(-STOP|nn, L, adj) +1
173 f_{STOP}( STOP|nn, L, non_adj) +1 # since there's at least one to the left
174 f_{STOP}(-STOP|nn, L, non_adj) +0
175 </pre>
176 <ul>
177 <li id="sec-3.4.1.1"><span class="todo">TODO</span> tweak<br/>
178 <pre>
179 f[head, 'STOP', 'LN'] += (i_h &lt;= 1) # first two words
180 f[head, '-STOP', 'LN'] += (not i_h &lt;= 1)
181 f[head, 'STOP', 'LA'] += (i_h == 0) # very first word
182 f[head, '-STOP', 'LA'] += (not i_h == 0)
183 f[head, 'STOP', 'RN'] += (i_h &gt;= n - 2) # last two words
184 f[head, '-STOP', 'RN'] += (not i_h &gt;= n - 2)
185 f[head, 'STOP', 'RA'] += (i_h == n - 1) # very last word
186 f[head, '-STOP', 'RA'] += (not i_h == n - 1)
187 </pre>
189 <pre>
190 # this one requires some additional rewriting since it
191 # introduces divisions by zero
192 f[head, 'STOP', 'LN'] += (i_h == 1) # second word
193 f[head, '-STOP', 'LN'] += (not i_h &lt;= 1) # not first two
194 f[head, 'STOP', 'LA'] += (i_h == 0) # first word
195 f[head, '-STOP', 'LA'] += (not i_h == 0) # not first
196 f[head, 'STOP', 'RN'] += (i_h == n - 2) # second-to-last
197 f[head, '-STOP', 'RN'] += (not i_h &gt;= n - 2) # not last two
198 f[head, 'STOP', 'RA'] += (i_h == n - 1) # last word
199 f[head, '-STOP', 'RA'] += (not i_h == n - 1) # not last
200 </pre>
202 <pre>
203 f[head, 'STOP', 'LN'] += (i_h == 1) # second word
204 f[head, '-STOP', 'LN'] += (not i_h == 1) # not second
205 f[head, 'STOP', 'LA'] += (i_h == 0) # first word
206 f[head, '-STOP', 'LA'] += (not i_h == 0) # not first
207 f[head, 'STOP', 'RN'] += (i_h == n - 2) # second-to-last
208 f[head, '-STOP', 'RN'] += (not i_h == n - 2) # not second-to-last
209 f[head, 'STOP', 'RA'] += (i_h == n - 1) # last word
210 f[head, '-STOP', 'RA'] += (not i_h == n - 1) # not last
211 </pre>
213 "all words take the same number of arguments" interpreted as
214 <pre>
215 for all heads:
216 p_STOP(head, 'STOP', 'LN') = 0.3
217 p_STOP(head, 'STOP', 'LA') = 0.5
218 p_STOP(head, 'STOP', 'RN') = 0.4
219 p_STOP(head, 'STOP', 'RA') = 0.7
220 </pre>
221 (which we easily may tweak in init_zeros())
222 </li>
223 </ul>
224 </li>
225 <li id="sec-3.4.2">P_CHOOSE<br/>
226 Go through the corpus, counting distances between heads and
227 arguments. In [det nn vb nn], we give
228 <ul>
229 <li>
230 f<sub>CHOOSE</sub>(nn|det, R) +1/1 + C
231 </li>
232 <li>
233 f<sub>CHOOSE</sub>(vb|det, R) +1/2 + C
234 </li>
235 <li>
236 f<sub>CHOOSE</sub>(nn|det, R) +1/3 + C
237 <ul>
238 <li>
239 If this were the full corpus, P<sub>CHOOSE</sub>(nn|det, R) would have
240 (1+1/3+2C) / sum_a f<sub>CHOOSE</sub>(a|det, R)
242 </li>
243 </ul>
244 </li>
245 </ul>
247 <p>The ROOT gets "each argument with equal probability", so in a sentence
248 of three words, 1/3 for each (in [nn vb nn], 'nn' gets 2/3). Basically
249 just a frequency count of the corpus&hellip;
250 </p></li>
251 </ul>
252 </li>
253 </ul>
254 </div>
256 </div>
258 <div id="outline-container-4" class="outline-2">
259 <h2 id="sec-4">4 [#C] Deferred</h2>
260 <div id="text-4">
262 <ul>
263 <li id="sec-4.1"><span class="todo">TODO</span> if loc_h == t, no need to try right-attachment rules &nbsp;&nbsp;&nbsp;<span class="tag">OPTIMIZE</span><br/>
264 ** TODO if loc_h == s, no need to try left-attachment rules :OPTIMIZE:
265 </li>
266 <li id="sec-4.3"><span class="done">DONE</span> inner_dmv() should disregard rules with heads not in sent &nbsp;&nbsp;&nbsp;<span class="tag">OPTIMIZE</span><br/>
267 <span class="timestamp-kwd">CLOSED: </span> <span class="timestamp">2008-06-08 Sun 10:18</span><br/>
268 If the sentence is "nn vbd det nn", we should not even look at rules
269 where
270 <pre>
271 rule.head() not in "nn vbd det nn".split()
272 </pre>
273 This is ruled out by getting rules from g.rules(LHS, sent).
276 Also, we optimize this further by saying we don't even recurse into
277 attachment rules where
278 <pre>
279 rule.head() not in sent[ s :r+1]
280 rule.head() not in sent[r+1:t+1]
281 </pre>
282 meaning, if we're looking at the span "vbd det", we only use
283 attachment rules where both daughters are members of ['vbd','det']
284 (although we don't (yet) care about removing rules that rewrite to the
285 same tag if there are no duplicate tags in the span, etc., that would
286 be a lot of trouble for little potential gain).
287 </p></li>
288 <li id="sec-4.4"><span class="todo">TODO</span> when reestimating P_STOP etc, remove rules with p &lt; epsilon &nbsp;&nbsp;&nbsp;<span class="tag">OPTIMIZE</span><br/>
289 </li>
290 <li id="sec-4.5"><span class="todo">TODO</span> inner_dmv, short ranges and impossible attachment &nbsp;&nbsp;&nbsp;<span class="tag">OPTIMIZE</span><br/>
291 If s-t &lt;= 2, there can be only one attachment below, so don't recurse
292 with both Lattach=True and Rattach=True.
295 If s-t &lt;= 1, there can be no attachment below, so only recurse with
296 Lattach=False, Rattach=False.
297 </p>
299 Put this in the loop under rewrite rules (could also do it in the STOP
300 section, but that would only have an effect on very short sentences).
301 </p></li>
302 <li id="sec-4.6"><span class="todo">TODO</span> clean up the module files &nbsp;&nbsp;&nbsp;<span class="tag">PRETTIER</span><br/>
303 Is there better way to divide dmv and harmonic? There's a two-way
304 dependency between the modules. Guess there could be a third file that
305 imports both the initialization and the actual EM stuff, while a file
306 containing constants and classes could be imported by all others:
307 <pre>
308 dmv.py imports dmv_EM.py imports dmv_classes.py
309 dmv.py imports dmv_inits.py imports dmv_classes.py
310 </pre>
312 </li>
313 <li id="sec-4.7"><span class="todo">TOGROK</span> Some (tagged) sentences are bound to come twice &nbsp;&nbsp;&nbsp;<span class="tag">OPTIMIZE</span><br/>
314 Eg, first sort and count, so that the corpus
315 [['nn','vbd','det','nn'],
316 ['vbd','nn','det','nn'],
317 ['nn','vbd','det','nn']]
318 becomes
319 [(['nn','vbd','det','nn'],2),
320 (['vbd','nn','det','nn'],1)]
321 and then in each loop through sentences, make sure we handle the
322 frequency correctly.
325 Is there much to gain here?
326 </p>
327 </li>
328 <li id="sec-4.8"><span class="todo">TOGROK</span> tags as numbers or tags as strings? &nbsp;&nbsp;&nbsp;<span class="tag">OPTIMIZE</span><br/>
329 Need to clean up the representation.
332 Stick with tag-strings in initialization then switch to numbers for
333 IO-algorithm perhaps? Can probably afford more string-matching in
334 initialization..
335 </p></li>
336 </ul>
337 </div>
339 </div>
341 <div id="outline-container-5" class="outline-2">
342 <h2 id="sec-5">5 Adjacency and combining it with inner()</h2>
343 <div id="text-5">
345 <p>Each DMV_Rule now has both a probN and a probA, for
346 adjacencies. inner() needs the correct one in each case.
347 </p>
349 Adjacency gives a problem with duplicate words/tags, eg. in the
350 sentence "a a b". If this has the dependency structure b-&gt;a<sub>0</sub>-&gt;a<sub>1</sub>,
351 then b is non-adjacent to a<sub>0</sub> and should use probN (for the LRStop and
352 the attachment of a<sub>0</sub>), while the other rules should all use
353 probA. But within the e(0,2,b) we can't just say "oh, a has index 0
354 so it's not adjacent to 2", since there's also an a at index 1, and
355 there's also a dependency structure b-&gt;a<sub>1</sub>-&gt;a<sub>0</sub> for that. We want
356 both. And in possibly much more complex versions.
357 </p>
359 So now we call inner() for each location of a head, and on each
360 terminal, loc_h must equal s (and t). In the recursive attachment
361 calls, we use the locations (sentence indices) of words to the left or
362 right of the head in calls to inner(). loc_h lets us check whether we
363 need probN or probA.
364 </p>
366 In each inner() call, loc_h is the location of the root of this
367 dependency structure.
368 </p><ul>
369 <li id="sec-5.1"><span class="done">DONE</span> [#A] test and debug my brilliant idea<br/>
370 <span class="timestamp-kwd">CLOSED: </span> <span class="timestamp">2008-06-08 Sun 10:28</span><br/>
371 </li>
372 <li id="sec-5.2"><span class="done">DONE</span> implement my brilliant idea.<br/>
373 <span class="timestamp-kwd">CLOSED: </span> <span class="timestamp">2008-06-01 Sun 17:19</span><br/>
374 <a href="src/dmv.py">e(sti) in dmv.py</a>
376 </li>
377 <li id="sec-5.3"><span class="done">DONE</span> [#A] test inner() on sentences with duplicate words<br/>
378 Works with eg. the sentence "h h h"
379 </li>
380 </ul>
381 </div>
383 </div>
385 <div id="outline-container-6" class="outline-2">
386 <h2 id="sec-6">6 Expectation Maximation in IO/DMV-terms</h2>
387 <div id="text-6">
389 <p>inner(s,t,LHS) calculates the expected number of trees headed by LHS
390 from s to t (sentence positions). This uses the P_STOP and P_CHOOSE
391 values, which have been conveniently distributed into CNF rules as
392 probN and probA (non-adjacent and adjacent probabilites).
393 </p>
395 When re-estimating, we use the expected values from inner() to get new
396 values for P_STOP and P_CHOOSE. When we've re-estimated for the entire
397 corpus, we distribute P_STOP and P_CHOOSE into the CNF rules again, so
398 that in the next round we use new probN and probA to find
399 inner-probabilites.
400 </p>
402 The distribution of P_STOP and P_CHOOSE into CNF rules also happens in
403 init_normalize() (here along with the creation of P_STOP and
404 P_CHOOSE); P_STOP is used to create CNF rules where one branch of the
405 rule is STOP, P_CHOOSE is used to create rules of the form
406 <pre>
407 h -&gt; h _a_
408 h_ -&gt; h_ _a_
409 </pre>
410 </p>
412 Since "adjacency" is not captured in regular CNF rules, we need two
413 probabilites for each rule, and inner() has to know when to use which.
414 </p>
415 <ul>
416 <li id="sec-6.1"><span class="todo">TODO</span> Corpus access<br/>
417 </li>
418 <li id="sec-6.2"><span class="todo">TOGROK</span> sentences or rules as the "outer loop"? &nbsp;&nbsp;&nbsp;<span class="tag">OPTIMIZE</span><br/>
419 In regard to the E/M-step, finding P<sub>STOP</sub>, P<sub>CHOOSE</sub>.
422 </li>
423 </ul>
424 </div>
426 </div>
428 <div id="outline-container-7" class="outline-2">
429 <h2 id="sec-7">7 Python-stuff</h2>
430 <div id="text-7">
432 <ul>
433 <li>
434 <a href="src/pseudo.py">pseudo.py</a>
435 </li>
436 <li>
437 <a href="http://nltk.org/doc/en/structured-programming.html">http://nltk.org/doc/en/structured-programming.html</a> recursive dynamic
438 </li>
439 <li>
440 <a href="http://nltk.org/doc/en/advanced-parsing.html">http://nltk.org/doc/en/advanced-parsing.html</a>
441 </li>
442 <li>
443 <a href="http://jaynes.colorado.edu/PythonIdioms.html">http://jaynes.colorado.edu/PythonIdioms.html</a>
447 </li>
448 </ul>
449 </div>
451 </div>
453 <div id="outline-container-8" class="outline-2">
454 <h2 id="sec-8">8 Git</h2>
455 <div id="text-8">
457 <p>Setting up a new project:
458 <pre>
459 git init
460 git add .
461 git commit -m "first release"
462 </pre>
463 </p>
465 Later on: (<code>-a</code> does <code>git rm</code> and <code>git add</code> automatically)
466 <pre>
467 git init
468 git commit -a -m "some subsequent release"
469 </pre>
470 </p>
472 Then push stuff up to the remote server:
473 <pre>
474 git push git+ssh://username@repo.or.cz/srv/git/dmvccm.git master
475 </pre>
476 </p>
478 (<code>eval `ssh-agent`</code> and <code>ssh-add</code> to avoid having to type in keyphrase all
479 the time)
480 </p>
482 Make a copy of the (remote) master branch:
483 <pre>
484 git clone git://repo.or.cz/dmvccm.git
485 </pre>
486 </p>
488 Make and name a new branch in this folder
489 <pre>
490 git checkout -b mybranch
491 </pre>
492 </p>
494 To save changes in <code>mybranch</code>:
495 <pre>
496 git commit -a
497 </pre>
498 </p>
500 Go back to the master branch (uncommitted changes from <code>mybranch</code> are
501 carried over):
502 <pre>
503 git checkout master
504 </pre>
505 </p>
507 Try out:
508 <pre>
509 git add --interactive
510 </pre>
511 </p>
513 Good tutorial:
514 <a href="http://www-cs-students.stanford.edu/~blynn//gitmagic/">http://www-cs-students.stanford.edu/~blynn//gitmagic/</a>
515 </p></div>
516 </div>
517 <div id="postamble"><p class="author"> Author: Kevin Brubeck Unhammer
518 <a href="mailto:K.BrubeckUnhammer at student uva nl ">&lt;K.BrubeckUnhammer at student uva nl &gt;</a>
519 </p>
520 <p class="date"> Date: 2008/06/08 12:43:26</p>
521 </div><p class="postamble">Skrive vha. emacs + <a href='http://orgmode.org/'>org-mode</a></p></body>
522 </html>