some fixes to outer
[dmvccm.git] / DMVCCM.html
blobd7d08d1ab1a46073a975daf3be5929faf2b3b21b
1 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
2 "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
3 <html xmlns="http://www.w3.org/1999/xhtml"
4 lang="en" xml:lang="en">
5 <head>
6 <title>DMV/CCM &ndash; todo-list / progress</title>
7 <meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>
8 <meta name="generator" content="Org-mode"/>
9 <meta name="generated" content="2008/06/11 16:40:21"/>
10 <meta name="author" content="Kevin Brubeck Unhammer"/>
11 <link rel="stylesheet" type="text/css" href="http://www.student.uib.no/~kun041/org.css">
12 </head><body>
13 <h1 class="title">DMV/CCM &ndash; todo-list / progress</h1>
16 <div id="table-of-contents">
17 <h2>Table of Contents</h2>
18 <div id="text-table-of-contents">
19 <ul>
20 <li><a href="#sec-1">1 dmvccm report and project</a></li>
21 <li><a href="#sec-2">2 [#A] outer probabilities</a></li>
22 <li><a href="#sec-3">3 [#B] P_STOP and P_CHOOSE for IO/EM (reestimation)</a></li>
23 <li><a href="#sec-4">4 Find Most Probable Parse of given test sentence</a></li>
24 <li><a href="#sec-5">5 Combine CCM with DMV</a></li>
25 <li><a href="#sec-6">6 Get a dependency parsed version of WSJ to test with</a></li>
26 <li><a href="#sec-7">7 Initialization </a></li>
27 <li><a href="#sec-8">8 [#C] Deferred</a></li>
28 <li><a href="#sec-9">9 Adjacency and combining it with inner()</a></li>
29 <li><a href="#sec-10">10 Expectation Maximation in IO/DMV-terms</a></li>
30 <li><a href="#sec-11">11 Python-stuff</a></li>
31 <li><a href="#sec-12">12 Git</a></li>
32 </ul>
33 </div>
34 </div>
36 <div id="outline-container-1" class="outline-2">
37 <h2 id="sec-1">1 dmvccm report and project</h2>
38 <div id="text-1">
40 <p><span class="timestamp-kwd">DEADLINE: </span> <span class="timestamp">2008-06-30 Mon</span><br/>
41 </p><ul>
42 <li>
43 <a href="src/dmv.py">dmv.py</a>
44 </li>
45 <li>
46 <a href="src/io.py">io.py</a>
47 </li>
48 <li>
49 <a href="src/harmonic.py">harmonic.py</a>
51 </li>
52 </ul>
54 <p>(But absolute, extended, really-quite-dead-now deadline: August 31&hellip;)
55 </p></div>
57 </div>
59 <div id="outline-container-2" class="outline-2">
60 <h2 id="sec-2">2 <span class="todo">TODO</span> [#A] outer probabilities</h2>
61 <div id="text-2">
63 <p>There are 6 different configurations here, based on what the above
64 (mother) node is, and what the Node for which we're computing is.
65 Here <b>r</b> is not between <b>s</b> and <b>t</b> as in inner(), but an <i>outer</i> index. <b>loc_N</b>
66 is the location of the Node head in the sentence, <b>loc_m</b> for the head
67 of the mother rule.
68 </p>
69 <ul>
70 <li>
71 mother is a RIGHT-stop:
72 <ul>
73 <li>
74 outer(<b>s, t</b>, mom.LHS, <b>loc_N</b>), no inner-call
75 </li>
76 <li>
77 adjacent iff <b>t</b> == <b>loc_m</b>
78 </li>
79 </ul>
80 </li>
81 <li>
82 mother is a LEFT-stop:
83 <ul>
84 <li>
85 outer(<b>s, t</b>, mom.LHS, <b>loc_N</b>), no inner-call
86 </li>
87 <li>
88 adjacent iff <b>s</b> == <b>loc_m</b>
90 </li>
91 </ul>
92 </li>
93 <li>
94 Node is on the LEFT branch (mom.L == Node)
95 <ul>
96 <li>
97 and mother is a LEFT attachment:
98 <ul>
99 <li>
100 <b>loc_N</b> will be in the LEFT branch, can be anything here.
101 </li>
102 <li>
103 In the RIGHT, attached, branch we find inner(<b>t+1, r</b>, mom.R, <b>loc_m</b>) for
104 all possible <b>loc_m</b> in the right part of the sentence.
105 </li>
106 <li>
107 outer(<b>s, r</b>, mom.LHS, <b>loc_m</b>).
108 </li>
109 <li>
110 adjacent iff <b>t+1</b> == <b>loc_m</b>
111 </li>
112 </ul>
113 </li>
114 <li>
115 and mother is a RIGHT attachment:
116 <ul>
117 <li>
118 <b>loc_m</b> = <b>loc_N</b>.
119 </li>
120 <li>
121 In the RIGHT, attached, branch we find inner(<b>t+1, r</b>, mom.R, <b>loc_R</b>) for
122 all possible <b>loc_R</b> in the right part of the sentence.
123 </li>
124 <li>
125 outer(<b>s, r</b>, mom.LHS, <b>loc_N</b>).
126 </li>
127 <li>
128 adjacent iff <b>t</b> == <b>loc_m</b>
130 </li>
131 </ul>
132 </li>
133 </ul>
134 </li>
135 <li>
136 Node is on the RIGHT branch (mom.R == Node)
137 <ul>
138 <li>
139 and mother is a LEFT attachment:
140 <ul>
141 <li>
142 <b>loc_m</b> = <b>loc_N</b>.
143 </li>
144 <li>
145 In the LEFT, attached, branch we find inner(<b>r, s-1</b>, mom.L, <b>loc_L</b>) for
146 all possible <b>loc_L</b> in the left part of the sentence.
147 </li>
148 <li>
149 outer(<b>r, t</b>, mom.LHS, <b>loc_m</b>).
150 </li>
151 <li>
152 adjacent iff <b>s</b> == <b>loc_m</b>
153 </li>
154 </ul>
155 </li>
156 <li>
157 and mother is a RIGHT attachment:
158 <ul>
159 <li>
160 <b>loc_N</b> will be in the RIGHT branch, can be anything here.
161 </li>
162 <li>
163 In the RIGHT, attached, branch we find inner(<b>r, s-1</b>, mom.L, <b>loc_m</b>) for
164 all possible <b>loc_m</b> in the right part of the sentence.
165 </li>
166 <li>
167 outer(<b>r, t</b>, mom.LHS, <b>loc_N</b>).
168 </li>
169 <li>
170 adjacent iff <b>s-1</b> == <b>loc_m</b>
172 </li>
173 </ul>
174 </li>
175 </ul>
176 </li>
177 </ul>
179 <p><img src="outer_attachments.jpg"/>
180 </p>
183 </div>
185 </div>
187 <div id="outline-container-3" class="outline-2">
188 <h2 id="sec-3">3 <span class="todo">TODO</span> [#B] P_STOP and P_CHOOSE for IO/EM (reestimation)</h2>
189 <div id="text-3">
191 <p><a href="src/dmv.py">dmv-P_STOP</a>
192 Remember: The P<sub>STOP</sub> formula is upside-down (left-to-right also).
193 (In the article..not the thesis)
194 </p>
195 <ul>
196 <li id="sec-3.1"><span class="todo">TOGROK</span> [#A] How do we only count from completed trees?<br/>
197 The chart from inner() contains several entries where the only ways to
198 reach them (from above) are through nodes that have zero probability;
199 we don't want these to give counts for reestimation.
202 prune() weeds out entries with only zero-probability mothers (or
203 grandmothers), but is incredibly slow (reestimating with pruning takes
204 almost 50 times longer).
205 </p>
207 Another solution is to add
208 </p></li>
209 <li id="sec-3.2">P_STOP formulas for various dir and adj:<br/>
210 Assuming this:
211 <ul>
212 <li>
213 P<sub>STOP</sub>(STOP|h,L,non_adj) = &sum;<sub>corpus</sub> &sum;<sub>s&lt;loc(h)</sub> &sum;<sub>t</sub>
214 inner(s,t,_h_, &hellip;) / &sum;<sub>corpus</sub> &sum;<sub>s&lt;loc(h)</sub> &sum;<sub>t</sub>
215 inner(s,t,h_, &hellip;)
216 </li>
217 <li>
218 P<sub>STOP</sub>(STOP|h,L,adj) = &sum;<sub>corpus</sub> &sum;<sub>s=loc(h)</sub> &sum;<sub>t</sub>
219 inner(s,t,_h_, &hellip;) / &sum;<sub>corpus</sub> &sum;<sub>s=loc(h)</sub> &sum;<sub>t</sub>
220 inner(s,t,h_, &hellip;)
221 </li>
222 <li>
223 P<sub>STOP</sub>(STOP|h,R,non_adj) = &sum;<sub>corpus</sub> &sum;<sub>s</sub> &sum;<sub>t&gt;loc(h)</sub>
224 inner(s,t,_h_, &hellip;) / &sum;<sub>corpus</sub> &sum;<sub>s</sub> &sum;<sub>t&gt;loc(h)</sub>
225 inner(s,t,h_, &hellip;)
226 </li>
227 <li>
228 P<sub>STOP</sub>(STOP|h,R,adj) = &sum;<sub>corpus</sub> &sum;<sub>s</sub> &sum;<sub>t=loc(h)</sub>
229 inner(s,t,_h_, &hellip;) / &sum;<sub>corpus</sub> &sum;<sub>s</sub> &sum;<sub>t=loc(h)</sub>
230 inner(s,t,h_, &hellip;)
232 </li>
233 </ul>
235 <p>(And P<sub>STOP</sub>(-STOP|&hellip;) = 1 - P<sub>STOP</sub>(STOP|&hellip;) )
236 </p></li>
237 <li id="sec-3.3">P_CHOOSE formula:<br/>
238 Assuming this:
239 <ul>
240 <li>
241 P<sub>CHOOSE</sub>(a|h,L) = &sum;<sub>corpus</sub> &sum;<sub>s&lt;loc(h)</sub> &sum;<sub>t&gt;=loc(h)</sub> &sum;<sub>r&lt;t</sub>
242 inner(s,r,_a_, &hellip;) / &sum;<sub>corpus</sub> &sum;<sub>s&lt;loc(h)</sub> &sum;<sub>t&gt;=loc(h)</sub>
243 inner(s,t,h_,&hellip;)
244 <ul>
245 <li>
246 t &gt;= loc(h) since there are many possibilites for
247 right-attachments below, and each of them alone gives a lower
248 probability (through multiplication) to the upper tree (so sum
249 them all)
250 </li>
251 </ul>
252 </li>
253 <li>
254 P<sub>CHOOSE</sub>(a|h,R) = &sum;<sub>corpus</sub> &sum;<sub>s=loc(h)</sub> &sum;<sub>r&gt;s</sub>
255 inner(s,r,_a_,&hellip;) / &sum;<sub>corpus</sub> &sum;<sub>s=loc(h)</sub> &sum;<sub>t</sub>
256 inner(s,t,h, &hellip;)
258 </li>
259 </ul>
260 </li>
261 </ul>
262 </div>
264 </div>
266 <div id="outline-container-4" class="outline-2">
267 <h2 id="sec-4">4 <span class="todo">TOGROK</span> Find Most Probable Parse of given test sentence</h2>
268 <div id="text-4">
270 <p>We could probably do it pretty easily from the chart:
271 </p><ul>
272 <li>
273 for all loc_h, select the (0,len(sent),ROOT,loc_h) that has the highest p
274 </li>
275 <li>
276 for stops, just keep going
277 </li>
278 <li>
279 for rewrites, for loc_dep select the one that has the highest p
280 </li>
281 </ul>
282 </div>
284 </div>
286 <div id="outline-container-5" class="outline-2">
287 <h2 id="sec-5">5 <span class="todo">TOGROK</span> Combine CCM with DMV</h2>
288 <div id="text-5">
290 </div>
292 </div>
294 <div id="outline-container-6" class="outline-2">
295 <h2 id="sec-6">6 <span class="todo">TODO</span> Get a dependency parsed version of WSJ to test with</h2>
296 <div id="text-6">
298 </div>
300 </div>
302 <div id="outline-container-7" class="outline-2">
303 <h2 id="sec-7">7 Initialization </h2>
304 <div id="text-7">
306 <p><a href="/Users/kiwibird/Documents/Skole/V08/Probability/dmvccm/src/dmv.py">dmv-inits</a>
307 </p>
309 We do have to go through the corpus, since the probabilities are based
310 on how far away in the sentence arguments are from their heads.
311 </p><ul>
312 <li id="sec-7.1"><span class="todo">TOGROK</span> CCM Initialization <br/>
313 P<sub>SPLIT</sub> used here&hellip; how, again?
314 </li>
315 <li id="sec-7.2"><span class="done">DONE</span> Separate initialization to another file? &nbsp;&nbsp;&nbsp;<span class="tag">PRETTIER</span><br/>
316 <span class="timestamp-kwd">CLOSED: </span> <span class="timestamp">2008-06-08 Sun 12:51</span><br/>
317 <a href="src/harmonic.py">harmonic.py</a>
318 </li>
319 <li id="sec-7.3"><span class="done">DONE</span> DMV Initialization probabilities<br/>
320 (from initialization frequency)
321 </li>
322 <li id="sec-7.4"><span class="done">DONE</span> DMV Initialization frequencies <br/>
323 <span class="timestamp-kwd">CLOSED: </span> <span class="timestamp">2008-05-27 Tue 20:04</span><br/>
324 <ul>
325 <li id="sec-7.4.1">P_STOP <br/>
326 P<sub>STOP</sub> is not well defined by K&amp;M. One possible interpretation given
327 the sentence [det nn vb nn] is
328 <pre>
329 f_{STOP}( STOP|det, L, adj) +1
330 f_{STOP}(-STOP|det, L, adj) +0
331 f_{STOP}( STOP|det, L, non_adj) +1
332 f_{STOP}(-STOP|det, L, non_adj) +0
333 f_{STOP}( STOP|det, R, adj) +0
334 f_{STOP}(-STOP|det, R, adj) +1
336 f_{STOP}( STOP|nn, L, adj) +0
337 f_{STOP}(-STOP|nn, L, adj) +1
338 f_{STOP}( STOP|nn, L, non_adj) +1 # since there's at least one to the left
339 f_{STOP}(-STOP|nn, L, non_adj) +0
340 </pre>
341 <ul>
342 <li id="sec-7.4.1.1"><span class="todo">TODO</span> tweak<br/>
343 <pre>
344 f[head, 'STOP', 'LN'] += (i_h &lt;= 1) # first two words
345 f[head, '-STOP', 'LN'] += (not i_h &lt;= 1)
346 f[head, 'STOP', 'LA'] += (i_h == 0) # very first word
347 f[head, '-STOP', 'LA'] += (not i_h == 0)
348 f[head, 'STOP', 'RN'] += (i_h &gt;= n - 2) # last two words
349 f[head, '-STOP', 'RN'] += (not i_h &gt;= n - 2)
350 f[head, 'STOP', 'RA'] += (i_h == n - 1) # very last word
351 f[head, '-STOP', 'RA'] += (not i_h == n - 1)
352 </pre>
354 <pre>
355 # this one requires some additional rewriting since it
356 # introduces divisions by zero
357 f[head, 'STOP', 'LN'] += (i_h == 1) # second word
358 f[head, '-STOP', 'LN'] += (not i_h &lt;= 1) # not first two
359 f[head, 'STOP', 'LA'] += (i_h == 0) # first word
360 f[head, '-STOP', 'LA'] += (not i_h == 0) # not first
361 f[head, 'STOP', 'RN'] += (i_h == n - 2) # second-to-last
362 f[head, '-STOP', 'RN'] += (not i_h &gt;= n - 2) # not last two
363 f[head, 'STOP', 'RA'] += (i_h == n - 1) # last word
364 f[head, '-STOP', 'RA'] += (not i_h == n - 1) # not last
365 </pre>
367 <pre>
368 f[head, 'STOP', 'LN'] += (i_h == 1) # second word
369 f[head, '-STOP', 'LN'] += (not i_h == 1) # not second
370 f[head, 'STOP', 'LA'] += (i_h == 0) # first word
371 f[head, '-STOP', 'LA'] += (not i_h == 0) # not first
372 f[head, 'STOP', 'RN'] += (i_h == n - 2) # second-to-last
373 f[head, '-STOP', 'RN'] += (not i_h == n - 2) # not second-to-last
374 f[head, 'STOP', 'RA'] += (i_h == n - 1) # last word
375 f[head, '-STOP', 'RA'] += (not i_h == n - 1) # not last
376 </pre>
378 "all words take the same number of arguments" interpreted as
379 <pre>
380 for all heads:
381 p_STOP(head, 'STOP', 'LN') = 0.3
382 p_STOP(head, 'STOP', 'LA') = 0.5
383 p_STOP(head, 'STOP', 'RN') = 0.4
384 p_STOP(head, 'STOP', 'RA') = 0.7
385 </pre>
386 (which we easily may tweak in init_zeros())
387 </li>
388 </ul>
389 </li>
390 <li id="sec-7.4.2">P_CHOOSE<br/>
391 Go through the corpus, counting distances between heads and
392 arguments. In [det nn vb nn], we give
393 <ul>
394 <li>
395 f<sub>CHOOSE</sub>(nn|det, R) +1/1 + C
396 </li>
397 <li>
398 f<sub>CHOOSE</sub>(vb|det, R) +1/2 + C
399 </li>
400 <li>
401 f<sub>CHOOSE</sub>(nn|det, R) +1/3 + C
402 <ul>
403 <li>
404 If this were the full corpus, P<sub>CHOOSE</sub>(nn|det, R) would have
405 (1+1/3+2C) / sum_a f<sub>CHOOSE</sub>(a|det, R)
407 </li>
408 </ul>
409 </li>
410 </ul>
412 <p>The ROOT gets "each argument with equal probability", so in a sentence
413 of three words, 1/3 for each (in [nn vb nn], 'nn' gets 2/3). Basically
414 just a frequency count of the corpus&hellip;
415 </p></li>
416 </ul>
417 </li>
418 </ul>
419 </div>
421 </div>
423 <div id="outline-container-8" class="outline-2">
424 <h2 id="sec-8">8 [#C] Deferred</h2>
425 <div id="text-8">
427 <p><a href="http://wiki.python.org/moin/PythonSpeed/PerformanceTips">http://wiki.python.org/moin/PythonSpeed/PerformanceTips</a> Eg., use
428 map/reduce/filter/[i for i in [i's]]/(i for i in [i's]) instead of
429 for-loops; use local variables for globals (global variables or or
430 functions), etc.
431 </p>
432 <ul>
433 <li id="sec-8.1"><span class="todo">TODO</span> when reestimating P_STOP etc, remove rules with p &lt; epsilon &nbsp;&nbsp;&nbsp;<span class="tag">OPTIMIZE</span><br/>
434 </li>
435 <li id="sec-8.2"><span class="todo">TODO</span> inner_dmv, short ranges and impossible attachment &nbsp;&nbsp;&nbsp;<span class="tag">OPTIMIZE</span><br/>
436 If s-t &lt;= 2, there can be only one attachment below, so don't recurse
437 with both Lattach=True and Rattach=True.
440 If s-t &lt;= 1, there can be no attachment below, so only recurse with
441 Lattach=False, Rattach=False.
442 </p>
444 Put this in the loop under rewrite rules (could also do it in the STOP
445 section, but that would only have an effect on very short sentences).
446 </p></li>
447 <li id="sec-8.3"><span class="todo">TODO</span> clean up the module files &nbsp;&nbsp;&nbsp;<span class="tag">PRETTIER</span><br/>
448 Is there better way to divide dmv and harmonic? There's a two-way
449 dependency between the modules. Guess there could be a third file that
450 imports both the initialization and the actual EM stuff, while a file
451 containing constants and classes could be imported by all others:
452 <pre>
453 dmv.py imports dmv_EM.py imports dmv_classes.py
454 dmv.py imports dmv_inits.py imports dmv_classes.py
455 </pre>
457 </li>
458 <li id="sec-8.4"><span class="todo">TOGROK</span> Some (tagged) sentences are bound to come twice &nbsp;&nbsp;&nbsp;<span class="tag">OPTIMIZE</span><br/>
459 Eg, first sort and count, so that the corpus
460 [['nn','vbd','det','nn'],
461 ['vbd','nn','det','nn'],
462 ['nn','vbd','det','nn']]
463 becomes
464 [(['nn','vbd','det','nn'],2),
465 (['vbd','nn','det','nn'],1)]
466 and then in each loop through sentences, make sure we handle the
467 frequency correctly.
470 Is there much to gain here?
471 </p>
472 </li>
473 <li id="sec-8.5"><span class="todo">TOGROK</span> tags as numbers or tags as strings? &nbsp;&nbsp;&nbsp;<span class="tag">OPTIMIZE</span><br/>
474 Need to clean up the representation.
477 Stick with tag-strings in initialization then switch to numbers for
478 IO-algorithm perhaps? Can probably afford more string-matching in
479 initialization..
480 </p></li>
481 <li id="sec-8.6"><span class="done">DONE</span> if loc_h == t, no need to try right-attachment rules &amp;v.v. &nbsp;&nbsp;&nbsp;<span class="tag">OPTIMIZE</span><br/>
482 <span class="timestamp-kwd">CLOSED: </span> <span class="timestamp">2008-06-10 Tue 14:34</span><br/>
483 (and if loc_h == s, no need to try left-attachment rules.)
486 Modest speed increase (5%).
487 </p></li>
488 <li id="sec-8.7"><span class="done">DONE</span> io.debug parameters should not call functions &nbsp;&nbsp;&nbsp;<span class="tag">OPTIMIZE</span><br/>
489 <span class="timestamp-kwd">CLOSED: </span> <span class="timestamp">2008-06-10 Tue 12:26</span><br/>
490 Exchanged all io.debug(str,'level') calls with statements of the form:
491 <pre>
492 if 'level' in io.DEBUG:
493 print str
494 </pre>
497 and got an almost threefold speed increase on inner().
498 </p></li>
499 <li id="sec-8.8"><span class="done">DONE</span> inner_dmv() should disregard rules with heads not in sent &nbsp;&nbsp;&nbsp;<span class="tag">OPTIMIZE</span><br/>
500 <span class="timestamp-kwd">CLOSED: </span> <span class="timestamp">2008-06-08 Sun 10:18</span><br/>
501 If the sentence is "nn vbd det nn", we should not even look at rules
502 where
503 <pre>
504 rule.head() not in "nn vbd det nn".split()
505 </pre>
506 This is ruled out by getting rules from g.rules(LHS, sent).
509 Also, we optimize this further by saying we don't even recurse into
510 attachment rules where
511 <pre>
512 rule.head() not in sent[ s :r+1]
513 rule.head() not in sent[r+1:t+1]
514 </pre>
515 meaning, if we're looking at the span "vbd det", we only use
516 attachment rules where both daughters are members of ['vbd','det']
517 (although we don't (yet) care about removing rules that rewrite to the
518 same tag if there are no duplicate tags in the span, etc., that would
519 be a lot of trouble for little potential gain).
520 </p></li>
521 </ul>
522 </div>
524 </div>
526 <div id="outline-container-9" class="outline-2">
527 <h2 id="sec-9">9 Adjacency and combining it with inner()</h2>
528 <div id="text-9">
530 <p>Each DMV_Rule now has both a probN and a probA, for
531 adjacencies. inner() needs the correct one in each case.
532 </p>
534 Adjacency gives a problem with duplicate words/tags, eg. in the
535 sentence "a a b". If this has the dependency structure b-&gt;a<sub>0</sub>-&gt;a<sub>1</sub>,
536 then b is non-adjacent to a<sub>0</sub> and should use probN (for the LRStop and
537 the attachment of a<sub>0</sub>), while the other rules should all use
538 probA. But within the e(0,2,b) we can't just say "oh, a has index 0
539 so it's not adjacent to 2", since there's also an a at index 1, and
540 there's also a dependency structure b-&gt;a<sub>1</sub>-&gt;a<sub>0</sub> for that. We want
541 both. And in possibly much more complex versions.
542 </p>
544 So now we call inner() for each location of a head, and on each
545 terminal, loc_h must equal s (and t). In the recursive attachment
546 calls, we use the locations (sentence indices) of words to the left or
547 right of the head in calls to inner(). loc_h lets us check whether we
548 need probN or probA.
549 </p>
551 In each inner() call, loc_h is the location of the root of this
552 dependency structure.
553 </p><ul>
554 <li id="sec-9.1">Possible alternate type of adjacency<br/>
555 K&amp;M's adjacency is just whether or not an argument has been generated
556 in the current direction yet. One could also make a stronger type of
557 adjacency, where h and a are not adjacent if b is in between, eg. with
558 the sentence "a b h" and the structure ((h-&gt;a), (a-&gt;b)), h is
559 K&amp;M-adjacent to a, but not next to a, since b is in between. It's easy
560 to check this type of adjacency in inner(), but it needs new rules for
561 P_STOP reestimation.
562 </li>
563 </ul>
564 </div>
566 </div>
568 <div id="outline-container-10" class="outline-2">
569 <h2 id="sec-10">10 Expectation Maximation in IO/DMV-terms</h2>
570 <div id="text-10">
572 <p>inner(s,t,LHS) calculates the expected number of trees headed by LHS
573 from s to t (sentence positions). This uses the P_STOP and P_CHOOSE
574 values, which have been conveniently distributed into CNF rules as
575 probN and probA (non-adjacent and adjacent probabilites).
576 </p>
578 When re-estimating, we use the expected values from inner() to get new
579 values for P_STOP and P_CHOOSE. When we've re-estimated for the entire
580 corpus, we distribute P_STOP and P_CHOOSE into the CNF rules again, so
581 that in the next round we use new probN and probA to find
582 inner-probabilites.
583 </p>
585 The distribution of P_STOP and P_CHOOSE into CNF rules also happens in
586 init_normalize() (here along with the creation of P_STOP and
587 P_CHOOSE); P_STOP is used to create CNF rules where one branch of the
588 rule is STOP, P_CHOOSE is used to create rules of the form
589 <pre>
590 h -&gt; h _a_
591 h_ -&gt; h_ _a_
592 </pre>
593 </p>
595 Since "adjacency" is not captured in regular CNF rules, we need two
596 probabilites for each rule, and inner() has to know when to use which.
597 </p>
598 <ul>
599 <li id="sec-10.1"><span class="todo">TODO</span> Corpus access<br/>
600 </li>
601 <li id="sec-10.2"><span class="todo">TOGROK</span> sentences or rules as the "outer loop"? &nbsp;&nbsp;&nbsp;<span class="tag">OPTIMIZE</span><br/>
602 In regard to the E/M-step, finding P<sub>STOP</sub>, P<sub>CHOOSE</sub>.
605 </li>
606 </ul>
607 </div>
609 </div>
611 <div id="outline-container-11" class="outline-2">
612 <h2 id="sec-11">11 Python-stuff</h2>
613 <div id="text-11">
615 <p>Make those debug statements steal a bit less attention in emacs:
616 <pre>
617 (font-lock-add-keywords
618 'python-mode ; not really regexp, a bit slow
619 '(("^\\( *\\)\\(\\if +'.+' +in +io.DEBUG. *\\(
620 \\1 .+$\\)+\\)" 2 font-lock-preprocessor-face t)))
621 (font-lock-add-keywords
622 'python-mode
623 '(("\\&lt;\\(\\(io\\.\\)?debug(.+)\\)" 1 font-lock-preprocessor-face t)))
624 </pre>
625 </p>
626 <ul>
627 <li>
628 <a href="src/pseudo.py">pseudo.py</a>
629 </li>
630 <li>
631 <a href="http://nltk.org/doc/en/structured-programming.html">http://nltk.org/doc/en/structured-programming.html</a> recursive dynamic
632 </li>
633 <li>
634 <a href="http://nltk.org/doc/en/advanced-parsing.html">http://nltk.org/doc/en/advanced-parsing.html</a>
635 </li>
636 <li>
637 <a href="http://jaynes.colorado.edu/PythonIdioms.html">http://jaynes.colorado.edu/PythonIdioms.html</a>
641 </li>
642 </ul>
643 </div>
645 </div>
647 <div id="outline-container-12" class="outline-2">
648 <h2 id="sec-12">12 Git</h2>
649 <div id="text-12">
651 <p>Repository web page: <a href="http://repo.or.cz/w/dmvccm.git">http://repo.or.cz/w/dmvccm.git</a>
652 </p>
654 Setting up a new project:
655 <pre>
656 git init
657 git add .
658 git commit -m "first release"
659 </pre>
660 </p>
662 Later on: (<code>-a</code> does <code>git rm</code> and <code>git add</code> automatically)
663 <pre>
664 git init
665 git commit -a -m "some subsequent release"
666 </pre>
667 </p>
669 Then push stuff up to the remote server:
670 <pre>
671 git push git+ssh://username@repo.or.cz/srv/git/dmvccm.git master
672 </pre>
673 </p>
675 (<code>eval `ssh-agent`</code> and <code>ssh-add</code> to avoid having to type in keyphrase all
676 the time)
677 </p>
679 Make a copy of the (remote) master branch:
680 <pre>
681 git clone git://repo.or.cz/dmvccm.git
682 </pre>
683 </p>
685 Make and name a new branch in this folder
686 <pre>
687 git checkout -b mybranch
688 </pre>
689 </p>
691 To save changes in <code>mybranch</code>:
692 <pre>
693 git commit -a
694 </pre>
695 </p>
697 Go back to the master branch (uncommitted changes from <code>mybranch</code> are
698 carried over):
699 <pre>
700 git checkout master
701 </pre>
702 </p>
704 Try out:
705 <pre>
706 git add --interactive
707 </pre>
708 </p>
710 Good tutorial:
711 <a href="http://www-cs-students.stanford.edu/~blynn//gitmagic/">http://www-cs-students.stanford.edu/~blynn//gitmagic/</a>
712 </p></div>
713 </div>
714 <div id="postamble"><p class="author"> Author: Kevin Brubeck Unhammer
715 <a href="mailto:K.BrubeckUnhammer at student uva nl ">&lt;K.BrubeckUnhammer at student uva nl &gt;</a>
716 </p>
717 <p class="date"> Date: 2008/06/11 16:40:21</p>
718 </div><p class="postamble">Skrive vha. emacs + <a href='http://orgmode.org/'>org-mode</a></p></body>
719 </html>