before trying new pchoose reestimation
[dmvccm.git] / DMVCCM.html
bloba09c831ec2f2539438bca680c56d8c006e67da19
1 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
2 "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
3 <html xmlns="http://www.w3.org/1999/xhtml"
4 lang="en" xml:lang="en">
5 <head>
6 <title>DMV/CCM &ndash; todo-list / progress</title>
7 <meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>
8 <meta name="generator" content="Org-mode"/>
9 <meta name="generated" content="2008/06/18 10:34:09"/>
10 <meta name="author" content="Kevin Brubeck Unhammer"/>
11 <link rel="stylesheet" type="text/css" href="http://www.student.uib.no/~kun041/org.css">
12 </head><body>
13 <h1 class="title">DMV/CCM &ndash; todo-list / progress</h1>
17 <div id="table-of-contents">
18 <h2>Table of Contents</h2>
19 <div id="text-table-of-contents">
20 <ul>
21 <li><a href="#sec-1">1 dmvccm report and project</a></li>
22 <li><a href="#sec-2">2 P_STOP and P_CHOOSE for IO/EM (reestimation)</a>
23 <ul>
24 <li><a href="#sec-2.1">2.1 L&amp;Y formula (20) or c()-formula?</a></li>
25 <li><a href="#sec-2.2">2.2 [#A] P_CHOOSE formula:</a>
26 <ul>
27 <li><a href="#sec-2.2.1">2.2.1 Problem with this formula:</a></li>
28 </ul>
29 </li>
30 <li><a href="#sec-2.3">2.3 P_STOP formulas for various dir and adj:</a></li>
31 </ul>
32 </li>
33 <li><a href="#sec-3">3 outer probabilities</a></li>
34 <li><a href="#sec-4">4 Find Most Probable Parse of given test sentence</a></li>
35 <li><a href="#sec-5">5 Combine CCM with DMV</a>
36 <ul>
37 <li><a href="#sec-5.1">5.1 Make inner() and outer() also allow left-first attachment</a></li>
38 </ul>
39 </li>
40 <li><a href="#sec-6">6 Get a dependency parsed version of WSJ to test with</a></li>
41 <li><a href="#sec-7">7 Initialization </a>
42 <ul>
43 <li><a href="#sec-7.1">7.1 CCM Initialization </a></li>
44 <li><a href="#sec-7.2">7.2 Separate initialization to another file?</a></li>
45 <li><a href="#sec-7.3">7.3 DMV Initialization probabilities</a></li>
46 <li><a href="#sec-7.4">7.4 DMV Initialization frequencies </a>
47 <ul>
48 <li><a href="#sec-7.4.1">7.4.1 P_STOP </a></li>
49 <li><a href="#sec-7.4.2">7.4.2 P_CHOOSE</a></li>
50 </ul></li>
51 </ul>
52 </li>
53 <li><a href="#sec-8">8 [#C] Deferred</a>
54 <ul>
55 <li><a href="#sec-8.1">8.1 when reestimating P_STOP etc, remove rules with p &lt; epsilon</a></li>
56 <li><a href="#sec-8.2">8.2 inner_dmv, short ranges and impossible attachment</a></li>
57 <li><a href="#sec-8.3">8.3 clean up the module files</a></li>
58 <li><a href="#sec-8.4">8.4 Some (tagged) sentences are bound to come twice</a></li>
59 <li><a href="#sec-8.5">8.5 tags as numbers or tags as strings?</a></li>
60 </ul>
61 </li>
62 <li><a href="#sec-9">9 Adjacency and combining it with the inside-outside algorithm</a>
63 <ul>
64 <li><a href="#sec-9.1">9.1 Possible alternate type of adjacency</a></li>
65 </ul>
66 </li>
67 <li><a href="#sec-10">10 Expectation Maximation in IO/DMV-terms</a></li>
68 <li><a href="#sec-11">11 Python-stuff</a></li>
69 <li><a href="#sec-12">12 Git</a></li>
70 </ul>
71 </div>
72 </div>
74 <div id="outline-container-1" class="outline-2">
75 <h2 id="sec-1">1 dmvccm report and project</h2>
76 <div id="text-1">
78 <p><span class="timestamp-kwd">DEADLINE: </span> <span class="timestamp">2008-06-30 Mon</span><br/>
79 </p><ul>
80 <li>
81 <a href="src/dmv.py">dmv.py</a>
82 </li>
83 <li>
84 <a href="src/io.py">io.py</a>
85 </li>
86 <li>
87 <a href="src/harmonic.py">harmonic.py</a>
89 </li>
90 </ul>
92 <p>(But absolute, extended, really-quite-dead-now deadline: August 31&hellip;)
93 </p>
94 <p>
95 <a href="http://www.student.uib.no/~kun041/dmvccm/DMVCCM_archive.html">Archived entries</a> from this file.
96 </p></div>
98 </div>
100 <div id="outline-container-2" class="outline-2">
101 <h2 id="sec-2">2 P_STOP and P_CHOOSE for IO/EM (reestimation)</h2>
102 <div id="text-2">
104 <p><a href="src/dmv.py">dmv-P_STOP</a>
105 Remember: The P<sub>STOP</sub> formula is upside-down (left-to-right also).
106 (In the article..not the <a href="http://www.eecs.berkeley.edu/~klein/papers/klein_thesis.pdf">thesis</a>)
107 </p>
109 </div>
111 <div id="outline-container-2.1" class="outline-3">
112 <h3 id="sec-2.1">2.1 <span class="todo">TOGROK</span> L&amp;Y formula (20) or c()-formula?</h3>
113 <div id="text-2.1">
115 <p>At first glance, I've been doing <br/>
116 c(s,r,_a_) / [ c(r+1,t, h_) * c(s,t, h_) ] <br/>
117 &lt;=&gt;
118 </p><table border="2" cellspacing="0" cellpadding="6" rules="groups" frame="hsides">
119 <col align="left"></col>
120 <tbody>
121 <tr><td>1/P_sent * e(s,r,_a_) * f(s,r,_a_)</td></tr>
122 <tr><td>[ 1/P_sent * e(r+1,t,h_) * f(r+t,t,h_) * 1/P_sent * e(s,t, h_) * f(s,t,h_) ]</td></tr>
123 </tbody>
124 </table>
128 while L&amp;Y uses
129 </p><table border="2" cellspacing="0" cellpadding="6" rules="groups" frame="hsides">
130 <col align="left"></col>
131 <tbody>
132 <tr><td>w_sent = 1/P_sent * prob(h_-&gt;_a_ h_) * e(s,r,_a_) * e(r+1,t, h_) * f(s,t,h_)</td></tr>
133 <tr><td>v_sent = 1/P_sent * e(s,t,h_) * f(s,t,h_) = c(s,t,h_)</td></tr>
134 </tbody>
135 </table>
139 Just for completion: <br/>
140 PSTOP =eg. c(s,t,_h_) / c(s,t,h_) for certain s,t <br/>
141 &lt;=&gt;
142 </p><table border="2" cellspacing="0" cellpadding="6" rules="groups" frame="hsides">
143 <col align="left"></col>
144 <tbody>
145 <tr><td>1/P_sent * e(s,t,_h_) * f(s,t,_h_)</td></tr>
146 <tr><td>1/P_sent * e(s,t, h_) * f(s,t, h_)</td></tr>
147 </tbody>
148 </table>
152 vs what seems to be the direct L&amp;Y translation:
153 </p>
154 <table border="2" cellspacing="0" cellpadding="6" rules="groups" frame="hsides">
155 <col align="left"></col>
156 <tbody>
157 <tr><td>1/P_sent * prob(_ h_ -&gt; STOP h_) * e(s,t,h_) * f(s,t,_h_)</td></tr>
158 <tr><td>1/P_sent * e(s,t, h_) * f(s,t, h_)</td></tr>
159 </tbody>
160 </table>
164 &hellip;but if we used that, e(s,t,h_) would cancel out and we'd just be
165 dividing the outer scores * rule-probability.. hmm.. Also, there's the
166 problem of what 'rule-probability' means here, since there's on the
167 one hand the CNF-rule probability, and on the other hand PSTOP and
168 PCHOOSE &ndash; in stop rules, PSTOP is the rule-probability, while for
169 attachment rules, it's PCHOOSE*(1-PSTOP).
170 </p></div>
172 </div>
174 <div id="outline-container-2.2" class="outline-3">
175 <h3 id="sec-2.2">2.2 <span class="todo">TODO</span> [#A] P_CHOOSE formula:</h3>
176 <div id="text-2.2">
178 <ul>
179 <li>
180 Note taken on <span class="timestamp">2008-06-16 Mon 12:33</span> <br/>
181 Implemented as assumed below, but this does not give 1-summing
182 probabilities, why? Are we checking the right configurations? Eg,
183 is it enough to just check c(s,t,h_), c(s,r,_a_) and
184 c(r+1,t,h_) for left attachments??
186 </li>
187 </ul>
189 <p>So far assuming this:
190 </p>
191 <table border="2" cellspacing="0" cellpadding="6" rules="groups" frame="hsides">
192 <col align="left"></col><col align="left"></col>
193 <tbody>
194 <tr><td>P<sub>CHOOSE</sub>(a : h,R) =</td><td>&sum;<sub>corpus</sub> &sum;<sub>s=loc(h)</sub> &sum;<sub>t &gt; loc(h)</sub> &sum;<sub>loc(h) &lt; r &lt;= t</sub> c(r,t,_a_)</td></tr>
195 <tr><td></td><td>&sum;<sub>corpus</sub> &sum;<sub>s=loc(h)</sub> &sum;<sub>t &gt; loc(h)</sub> &sum;<sub>loc(h) &lt; r &lt;= t</sub> c(s,t,h) * c(s, r-1, h_)</td></tr>
196 <tr><td></td><td></td></tr>
197 <tr><td>P<sub>CHOOSE</sub>(a : h,L) =</td><td>&sum;<sub>corpus</sub> &sum;<sub>s&lt;loc(h)</sub> &sum;<sub>t&gt;=loc(h)</sub> &sum;<sub>r&lt;loc(h)</sub> c(s,r,_a_)</td></tr>
198 <tr><td></td><td>&sum;<sub>corpus</sub> &sum;<sub>s&lt;loc(h)</sub> &sum;<sub>t&gt;=loc(h)</sub> &sum;<sub>r&lt;loc(h)</sub> c(s,t,h_) * c(r+1, t, h_)</td></tr>
199 </tbody>
200 </table>
202 t &gt;= loc(h) since there are many possibilites for right-attachments
203 below, and each of them alone gives a lower probability (through
204 multiplication) to the upper tree (so add them all)
207 The reason we have to check <i>both</i> children of the attachments is that we
208 have to make sure they are contiguous (otherwise we would have no way
209 of ruling out eg. h_-&gt;_b_,_b_-&gt;b_-&gt;_a_, where h_ covers <b>s</b> and <b>t</b>,_b_ is
210 from <b>s</b> to <b>x&lt;r</b> and _ a_ is from <b>s</b> to <b>r</b>).
211 </p>
212 </div>
214 <div id="outline-container-2.2.1" class="outline-4">
215 <h4 id="sec-2.2.1">2.2.1 <span class="todo">TOGROK</span> Problem with this formula:</h4>
216 <div id="text-2.2.1">
218 <p>On calculating P<sub>CHOOSE</sub>(det | vbd, L) from the 1-sentence corpus
219 "det nn vbd", there are two ways in which 'det' could be left
220 attached; one is where it is attached non-adjacently (after 'vbd' has
221 attached 'nn'):
222 <pre>
223 &gt;&gt;&gt; c_L = c(0,0,(SEAL,g.tagnum('det')),0,g,'det nn vbd'.split(),{},{})
224 &gt;&gt;&gt; c_L
225 0.62669683257918563 # so far so good
227 &gt;&gt;&gt; c_R = c(1,2,(RGO_L,g.tagnum('vbd')),2,g,'det nn vbd'.split(),{},{})
228 &gt;&gt;&gt; c_R
229 0.11312217194570134 # still seeems an OK probability
231 &gt;&gt;&gt; c_M = c(0,2,(RGO_L,g.tagnum('vbd')),2,g,'det nn vbd'.split(),{},{})
232 &gt;&gt;&gt; c_M
233 0.31674208144796384 # and this seems good
235 &gt;&gt;&gt; c_L / (c_R * c_M)
236 17.490571428571432 # but this is Way off...
237 </pre>
238 </p>
239 </div>
240 </div>
242 </div>
244 <div id="outline-container-2.3" class="outline-3">
245 <h3 id="sec-2.3">2.3 <span class="done">DONE</span> P_STOP formulas for various dir and adj:</h3>
246 <div id="text-2.3">
248 <p><span class="timestamp-kwd">CLOSED: </span> <span class="timestamp">2008-06-15 Sun 23:40</span><br/>
249 Assuming this:
250 </p>
251 <table border="2" cellspacing="0" cellpadding="6" rules="groups" frame="hsides">
252 <col align="left"></col><col align="left"></col>
253 <tbody>
254 <tr><td>P<sub>STOP</sub>(STOP : h,L,non_adj) =</td><td>&sum;<sub>corpus</sub> &sum;<sub>s&lt;loc(h)</sub> &sum;<sub>t&gt;=loc(h)</sub> c(s,t,_h_)</td></tr>
255 <tr><td></td><td>&sum;<sub>corpus</sub> &sum;<sub>s&lt;loc(h)</sub> &sum;<sub>t&gt;=loc(h)</sub> c(s,t,h_)</td></tr>
256 <tr><td></td><td></td></tr>
257 <tr><td>P<sub>STOP</sub>(STOP : h,L,adj) =</td><td>&sum;<sub>corpus</sub> &sum;<sub>s=loc(h)</sub> &sum;<sub>t&gt;=loc(h)</sub> c(s,t,_h_)</td></tr>
258 <tr><td></td><td>&sum;<sub>corpus</sub> &sum;<sub>s=loc(h)</sub> &sum;<sub>t&gt;=loc(h)</sub> c(s,t,h_)</td></tr>
259 <tr><td></td><td></td></tr>
260 <tr><td>P<sub>STOP</sub>(STOP : h,R,non_adj) =</td><td>&sum;<sub>corpus</sub> &sum;<sub>s=loc(h)</sub> &sum;<sub>t&gt;loc(h)</sub> c(s,t,h_)</td></tr>
261 <tr><td></td><td>&sum;<sub>corpus</sub> &sum;<sub>s=loc(h)</sub> &sum;<sub>t&gt;loc(h)</sub> c(s,t,h)</td></tr>
262 <tr><td></td><td></td></tr>
263 <tr><td>P<sub>STOP</sub>(STOP : h,R,adj) =</td><td>&sum;<sub>corpus</sub> &sum;<sub>s=loc(h)</sub> &sum;<sub>t=loc(h)</sub> c(s,t,h_)</td></tr>
264 <tr><td></td><td>&sum;<sub>corpus</sub> &sum;<sub>s=loc(h)</sub> &sum;<sub>t=loc(h)</sub> c(s,t,h)</td></tr>
265 </tbody>
266 </table>
270 (And P<sub>STOP</sub>(-STOP|&hellip;) = 1 - P<sub>STOP</sub>(STOP|&hellip;) )
271 </p></div>
272 </div>
274 </div>
276 <div id="outline-container-3" class="outline-2">
277 <h2 id="sec-3">3 <span class="done">DONE</span> outer probabilities</h2>
278 <div id="text-3">
280 <p><span class="timestamp-kwd">CLOSED: </span> <span class="timestamp">2008-06-12 Thu 11:11</span><br/>
281 <a name="outer">&nbsp;</a>
282 There are 6 different configurations here, based on what the above
283 (mother) node is, and what the Node for which we're computing is.
284 Here <b>r</b> is not between <b>s</b> and <b>t</b> as in inner(), but an <i>outer</i> index. <b>loc_N</b>
285 is the location of the Node head in the sentence, <b>loc_m</b> for the head
286 of the mother of Node.
287 </p>
288 <ul>
289 <li>
290 mother is a RIGHT-stop:
291 <ul>
292 <li>
293 outer(<b>s, t</b>, mom.LHS, <b>loc_N</b>), no inner-call
294 </li>
295 <li>
296 adjacent iff <b>t</b> == <b>loc_m</b>
297 </li>
298 </ul>
299 </li>
300 <li>
301 mother is a LEFT-stop:
302 <ul>
303 <li>
304 outer(<b>s, t</b>, mom.LHS, <b>loc_N</b>), no inner-call
305 </li>
306 <li>
307 adjacent iff <b>s</b> == <b>loc_m</b>
309 </li>
310 </ul>
311 </li>
312 <li>
313 Node is on the LEFT branch (mom.L == Node)
314 <ul>
315 <li>
316 and mother is a LEFT attachment:
317 <ul>
318 <li>
319 <b>loc_N</b> will be in the LEFT branch, can be anything here.
320 </li>
321 <li>
322 In the RIGHT, non-attached, branch we find inner(<b>t+1, r</b>, mom.R,
323 <b>loc_m</b>) for all possible <b>loc_m</b> in the right part of the sentence.
324 </li>
325 <li>
326 outer(<b>s, r</b>, mom.LHS, <b>loc_m</b>).
327 </li>
328 <li>
329 adjacent iff <b>t+1</b> == <b>loc_m</b>
330 </li>
331 </ul>
332 </li>
333 <li>
334 and mother is a RIGHT attachment:
335 <ul>
336 <li>
337 <b>loc_m</b> = <b>loc_N</b>.
338 </li>
339 <li>
340 In the RIGHT, attached, branch we find inner(<b>t+1, r</b>, mom.R, <b>loc_R</b>) for
341 all possible <b>loc_R</b> in the right part of the sentence.
342 </li>
343 <li>
344 outer(<b>s, r</b>, mom.LHS, <b>loc_N</b>).
345 </li>
346 <li>
347 adjacent iff <b>t</b> == <b>loc_m</b>
349 </li>
350 </ul>
351 </li>
352 </ul>
353 </li>
354 <li>
355 Node is on the RIGHT branch (mom.R == Node)
356 <ul>
357 <li>
358 and mother is a LEFT attachment:
359 <ul>
360 <li>
361 <b>loc_m</b> = <b>loc_N</b>.
362 </li>
363 <li>
364 In the LEFT, attached, branch we find inner(<b>r, s-1</b>, mom.L, <b>loc_L</b>) for
365 all possible <b>loc_L</b> in the left part of the sentence.
366 </li>
367 <li>
368 outer(<b>r, t</b>, mom.LHS, <b>loc_m</b>).
369 </li>
370 <li>
371 adjacent iff <b>s</b> == <b>loc_m</b>
372 </li>
373 </ul>
374 </li>
375 <li>
376 and mother is a RIGHT attachment:
377 <ul>
378 <li>
379 <b>loc_N</b> will be in the RIGHT branch, can be anything here.
380 </li>
381 <li>
382 In the LEFT, non-attached, branch we find inner(<b>r, s-1</b>, mom.L, <b>loc_m</b>) for
383 all possible <b>loc_m</b> in the left part of the sentence.
384 </li>
385 <li>
386 outer(<b>r, t</b>, mom.LHS, <b>loc_N</b>).
387 </li>
388 <li>
389 adjacent iff <b>s-1</b> == <b>loc_m</b>
391 </li>
392 </ul>
393 </li>
394 </ul>
395 </li>
396 </ul>
398 <p><img src="outer_attachments.jpg"/>
399 </p>
401 <pre>
402 in notes: in code (constants): in Klein thesis:
403 -------------------------------------------------------------------
404 _h_ SEAL bar over h
405 h_ RGO_L right-under-left-arrow over h
406 h GO_R right-arrow over h
408 LGO_R left-under-right-arrow over h
409 GO_L left-arrow over h
410 </pre>
411 </p>
413 Also, unlike in <a href="http://bibsonomy.org/bibtex/2b9f6798bb092697da7042ca3f5dee795">Lari & Young</a>, non-ROOT ('S') symbols may cover the
414 whole sentence, but ROOT may <i>only</i> appear if it covers the whole
415 sentence.
416 </p>
418 </div>
420 </div>
422 <div id="outline-container-4" class="outline-2">
423 <h2 id="sec-4">4 <span class="todo">TOGROK</span> Find Most Probable Parse of given test sentence</h2>
424 <div id="text-4">
426 <p>We could probably do it pretty easily from the chart:
427 </p><ul>
428 <li>
429 for all loc_h, select the (0,len(sent),ROOT,loc_h) that has the highest p
430 </li>
431 <li>
432 for stops, just keep going
433 </li>
434 <li>
435 for rewrites, for loc_dep select the one that has the highest p
436 </li>
437 </ul>
438 </div>
440 </div>
442 <div id="outline-container-5" class="outline-2">
443 <h2 id="sec-5">5 <span class="todo">TOGROK</span> Combine CCM with DMV</h2>
444 <div id="text-5">
446 <p>Page 108 (pdf: 124) of <a href="http://www.eecs.berkeley.edu/~klein/papers/klein_thesis.pdf">Klein's thesis</a> gives info about this.
447 </p>
448 </div>
450 <div id="outline-container-5.1" class="outline-3">
451 <h3 id="sec-5.1">5.1 <span class="todo">TODO</span> Make inner() and outer() also allow left-first attachment</h3>
452 <div id="text-5.1">
454 <p>Using P<sub>ORDER</sub>(<i>left-first</i> | w) etc.
455 </p></div>
456 </div>
458 </div>
460 <div id="outline-container-6" class="outline-2">
461 <h2 id="sec-6">6 <span class="todo">TODO</span> Get a dependency parsed version of WSJ to test with</h2>
462 <div id="text-6">
464 </div>
466 </div>
468 <div id="outline-container-7" class="outline-2">
469 <h2 id="sec-7">7 Initialization </h2>
470 <div id="text-7">
472 <p><a href="/Users/kiwibird/Documents/Skole/V08/Probability/dmvccm/src/dmv.py">dmv-inits</a>
473 </p>
475 We go through the corpus, since the probabilities are based on how far
476 away in the sentence arguments are from their heads.
477 </p>
478 </div>
480 <div id="outline-container-7.1" class="outline-3">
481 <h3 id="sec-7.1">7.1 <span class="todo">TOGROK</span> CCM Initialization </h3>
482 <div id="text-7.1">
484 <p>P<sub>SPLIT</sub> used here&hellip; how, again?
485 </p></div>
487 </div>
489 <div id="outline-container-7.2" class="outline-3">
490 <h3 id="sec-7.2">7.2 <span class="done">DONE</span> Separate initialization to another file? &nbsp;&nbsp;&nbsp;<span class="tag">PRETTIER</span></h3>
491 <div id="text-7.2">
493 <p><span class="timestamp-kwd">CLOSED: </span> <span class="timestamp">2008-06-08 Sun 12:51</span><br/>
494 <a href="src/harmonic.py">harmonic.py</a>
495 </p></div>
497 </div>
499 <div id="outline-container-7.3" class="outline-3">
500 <h3 id="sec-7.3">7.3 <span class="done">DONE</span> DMV Initialization probabilities</h3>
501 <div id="text-7.3">
503 <p>(from initialization frequency)
504 </p></div>
506 </div>
508 <div id="outline-container-7.4" class="outline-3">
509 <h3 id="sec-7.4">7.4 <span class="done">DONE</span> DMV Initialization frequencies </h3>
510 <div id="text-7.4">
512 <p><span class="timestamp-kwd">CLOSED: </span> <span class="timestamp">2008-05-27 Tue 20:04</span><br/>
513 </p>
514 </div>
516 <div id="outline-container-7.4.1" class="outline-4">
517 <h4 id="sec-7.4.1">7.4.1 P_STOP </h4>
518 <div id="text-7.4.1">
520 <p>P<sub>STOP</sub> is not well defined by K&amp;M. One possible interpretation given
521 the sentence [det nn vb nn] is
522 <pre>
523 f_{STOP}( STOP|det, L, adj) +1
524 f_{STOP}(-STOP|det, L, adj) +0
525 f_{STOP}( STOP|det, L, non_adj) +1
526 f_{STOP}(-STOP|det, L, non_adj) +0
527 f_{STOP}( STOP|det, R, adj) +0
528 f_{STOP}(-STOP|det, R, adj) +1
530 f_{STOP}( STOP|nn, L, adj) +0
531 f_{STOP}(-STOP|nn, L, adj) +1
532 f_{STOP}( STOP|nn, L, non_adj) +1 # since there's at least one to the left
533 f_{STOP}(-STOP|nn, L, non_adj) +0
534 </pre>
535 </p><ul>
536 <li id="sec-7.4.1.1"><span class="todo">TODO</span> tweak<br/>
537 <pre>
538 f[head, 'STOP', 'LN'] += (i_h &lt;= 1) # first two words
539 f[head, '-STOP', 'LN'] += (not i_h &lt;= 1)
540 f[head, 'STOP', 'LA'] += (i_h == 0) # very first word
541 f[head, '-STOP', 'LA'] += (not i_h == 0)
542 f[head, 'STOP', 'RN'] += (i_h &gt;= n - 2) # last two words
543 f[head, '-STOP', 'RN'] += (not i_h &gt;= n - 2)
544 f[head, 'STOP', 'RA'] += (i_h == n - 1) # very last word
545 f[head, '-STOP', 'RA'] += (not i_h == n - 1)
546 </pre>
548 <pre>
549 # this one requires some additional rewriting since it
550 # introduces divisions by zero
551 f[head, 'STOP', 'LN'] += (i_h == 1) # second word
552 f[head, '-STOP', 'LN'] += (not i_h &lt;= 1) # not first two
553 f[head, 'STOP', 'LA'] += (i_h == 0) # first word
554 f[head, '-STOP', 'LA'] += (not i_h == 0) # not first
555 f[head, 'STOP', 'RN'] += (i_h == n - 2) # second-to-last
556 f[head, '-STOP', 'RN'] += (not i_h &gt;= n - 2) # not last two
557 f[head, 'STOP', 'RA'] += (i_h == n - 1) # last word
558 f[head, '-STOP', 'RA'] += (not i_h == n - 1) # not last
559 </pre>
561 <pre>
562 f[head, 'STOP', 'LN'] += (i_h == 1) # second word
563 f[head, '-STOP', 'LN'] += (not i_h == 1) # not second
564 f[head, 'STOP', 'LA'] += (i_h == 0) # first word
565 f[head, '-STOP', 'LA'] += (not i_h == 0) # not first
566 f[head, 'STOP', 'RN'] += (i_h == n - 2) # second-to-last
567 f[head, '-STOP', 'RN'] += (not i_h == n - 2) # not second-to-last
568 f[head, 'STOP', 'RA'] += (i_h == n - 1) # last word
569 f[head, '-STOP', 'RA'] += (not i_h == n - 1) # not last
570 </pre>
572 "all words take the same number of arguments" interpreted as
573 <pre>
574 for all heads:
575 p_STOP(head, 'STOP', 'LN') = 0.3
576 p_STOP(head, 'STOP', 'LA') = 0.5
577 p_STOP(head, 'STOP', 'RN') = 0.4
578 p_STOP(head, 'STOP', 'RA') = 0.7
579 </pre>
580 (which we easily may tweak in init_zeros())
581 </li>
582 </ul>
583 </div>
585 </div>
587 <div id="outline-container-7.4.2" class="outline-4">
588 <h4 id="sec-7.4.2">7.4.2 P_CHOOSE</h4>
589 <div id="text-7.4.2">
591 <p>Go through the corpus, counting distances between heads and
592 arguments. In [det nn vb nn], we give
593 </p><ul>
594 <li>
595 f<sub>CHOOSE</sub>(nn|det, R) +1/1 + C
596 </li>
597 <li>
598 f<sub>CHOOSE</sub>(vb|det, R) +1/2 + C
599 </li>
600 <li>
601 f<sub>CHOOSE</sub>(nn|det, R) +1/3 + C
602 <ul>
603 <li>
604 If this were the full corpus, P<sub>CHOOSE</sub>(nn|det, R) would have
605 (1+1/3+2C) / sum_a f<sub>CHOOSE</sub>(a|det, R)
607 </li>
608 </ul>
609 </li>
610 </ul>
612 <p>The ROOT gets "each argument with equal probability", so in a sentence
613 of three words, 1/3 for each (in [nn vb nn], 'nn' gets 2/3). Basically
614 just a frequency count of the corpus&hellip;
615 </p>
617 In a sense there are no terminal probabilities, since an <i>h</i> can only
618 rewrite to an 'h' anyway (it's just a check for whether, at this
619 location in the sentence, we have the right POS-tag).
620 </p></div>
621 </div>
622 </div>
624 </div>
626 <div id="outline-container-8" class="outline-2">
627 <h2 id="sec-8">8 [#C] Deferred</h2>
628 <div id="text-8">
630 <p><a href="http://wiki.python.org/moin/PythonSpeed/PerformanceTips">http://wiki.python.org/moin/PythonSpeed/PerformanceTips</a> Eg., use
631 map/reduce/filter/[i for i in [i's]]/(i for i in [i's]) instead of
632 for-loops; use local variables for globals (global variables or or
633 functions), etc.
634 </p>
636 </div>
638 <div id="outline-container-8.1" class="outline-3">
639 <h3 id="sec-8.1">8.1 <span class="todo">TODO</span> when reestimating P_STOP etc, remove rules with p &lt; epsilon &nbsp;&nbsp;&nbsp;<span class="tag">OPTIMIZE</span></h3>
640 <div id="text-8.1">
642 </div>
644 </div>
646 <div id="outline-container-8.2" class="outline-3">
647 <h3 id="sec-8.2">8.2 <span class="todo">TODO</span> inner_dmv, short ranges and impossible attachment &nbsp;&nbsp;&nbsp;<span class="tag">OPTIMIZE</span></h3>
648 <div id="text-8.2">
650 <p>If s-t &lt;= 2, there can be only one attachment below, so don't recurse
651 with both Lattach=True and Rattach=True.
652 </p>
654 If s-t &lt;= 1, there can be no attachment below, so only recurse with
655 Lattach=False, Rattach=False.
656 </p>
658 Put this in the loop under rewrite rules (could also do it in the STOP
659 section, but that would only have an effect on very short sentences).
660 </p></div>
662 </div>
664 <div id="outline-container-8.3" class="outline-3">
665 <h3 id="sec-8.3">8.3 <span class="todo">TODO</span> clean up the module files &nbsp;&nbsp;&nbsp;<span class="tag">PRETTIER</span></h3>
666 <div id="text-8.3">
668 <p>Is there better way to divide dmv and harmonic? There's a two-way
669 dependency between the modules. Guess there could be a third file that
670 imports both the initialization and the actual EM stuff, while a file
671 containing constants and classes could be imported by all others:
672 <pre>
673 dmv.py imports dmv_EM.py imports dmv_classes.py
674 dmv.py imports dmv_inits.py imports dmv_classes.py
675 </pre>
676 </p>
677 </div>
679 </div>
681 <div id="outline-container-8.4" class="outline-3">
682 <h3 id="sec-8.4">8.4 <span class="todo">TOGROK</span> Some (tagged) sentences are bound to come twice &nbsp;&nbsp;&nbsp;<span class="tag">OPTIMIZE</span></h3>
683 <div id="text-8.4">
685 <p>Eg, first sort and count, so that the corpus
686 [['nn','vbd','det','nn'],
687 ['vbd','nn','det','nn'],
688 ['nn','vbd','det','nn']]
689 becomes
690 [(['nn','vbd','det','nn'],2),
691 (['vbd','nn','det','nn'],1)]
692 and then in each loop through sentences, make sure we handle the
693 frequency correctly.
694 </p>
696 Is there much to gain here?
697 </p>
698 </div>
700 </div>
702 <div id="outline-container-8.5" class="outline-3">
703 <h3 id="sec-8.5">8.5 <span class="todo">TOGROK</span> tags as numbers or tags as strings? &nbsp;&nbsp;&nbsp;<span class="tag">OPTIMIZE</span></h3>
704 <div id="text-8.5">
706 <p>Need to clean up the representation.
707 </p>
709 Stick with tag-strings in initialization then switch to numbers for
710 IO-algorithm perhaps? Can probably afford more string-matching in
711 initialization..
712 </p></div>
713 </div>
715 </div>
717 <div id="outline-container-9" class="outline-2">
718 <h2 id="sec-9">9 Adjacency and combining it with the inside-outside algorithm</h2>
719 <div id="text-9">
721 <p>Each DMV_Rule has both a probN and a probA, for adjacencies. inner()
722 and outer() needs the correct one in each case.
723 </p>
725 In each inner() call, loc_h is the location of the head of this
726 dependency structure. In each outer() call, it's the head of the <i>Node</i>,
727 the structure we're looking outside of.
728 </p>
730 We call inner() for each location of a head, and on each terminal,
731 loc_h must equal s (and t). In the recursive attachment calls, we use
732 the locations (sentence indices) of words to the left or right of the
733 head in calls to inner(). <i>loc_h lets us check whether we need probN or probA</i>.
734 </p>
735 </div>
737 <div id="outline-container-9.1" class="outline-3">
738 <h3 id="sec-9.1">9.1 Possible alternate type of adjacency</h3>
739 <div id="text-9.1">
741 <p>K&amp;M's adjacency is just whether or not an argument has been generated
742 in the current direction yet. One could also make a stronger type of
743 adjacency, where h and a are not adjacent if b is in between, eg. with
744 the sentence "a b h" and the structure ((h-&gt;a), (a-&gt;b)), h is
745 K&amp;M-adjacent to a, but not next to a, since b is in between. It's easy
746 to check this type of adjacency in inner(), but it needs new rules for
747 P_STOP reestimation.
748 </p></div>
749 </div>
751 </div>
753 <div id="outline-container-10" class="outline-2">
754 <h2 id="sec-10">10 Expectation Maximation in IO/DMV-terms</h2>
755 <div id="text-10">
757 <p>outer(s,t,Node) and inner(s,t,Node) calculates the expected number of
758 trees (CNF-)headed by Node from <b>s</b> to <b>t</b> (sentence positions). This uses
759 the P_STOP and P_CHOOSE values, which have been conveniently
760 distributed into CNF rules as probN and probA (non-adjacent and
761 adjacent probabilites).
762 </p>
764 When re-estimating, we use the expected values from outer() and
765 inner() to get new values for P_STOP and P_CHOOSE. When we've
766 re-estimated for the entire corpus, we distribute P_STOP and P_CHOOSE
767 into the CNF rules again, so that in the next round we use new probN
768 and probA to find outer- and inner-probabilites.
769 </p>
771 The distribution of P_STOP and P_CHOOSE into CNF rules also happens in
772 init_normalize() (here along with the creation of P_STOP and
773 P_CHOOSE); P_STOP is used to create CNF rules where one branch of the
774 rule is STOP, P_CHOOSE is used to create rules of the form
775 <pre>
776 h -&gt; h _a_
777 h_ -&gt; h_ _a_
778 </pre>
779 </p>
781 Since "adjacency" is not captured in regular CNF rules, we need two
782 probabilites for each rule, and outer() and inner() have to know when
783 to use which.
784 </p>
786 </div>
788 </div>
790 <div id="outline-container-11" class="outline-2">
791 <h2 id="sec-11">11 Python-stuff</h2>
792 <div id="text-11">
794 <p>Make those debug statements steal a bit less attention in emacs:
795 <pre>
796 (font-lock-add-keywords
797 'python-mode ; not really regexp, a bit slow
798 '(("^\\( *\\)\\(\\if +'.+' +in +io.DEBUG. *\\(
799 \\1 .+$\\)+\\)" 2 font-lock-preprocessor-face t)))
800 (font-lock-add-keywords
801 'python-mode
802 '(("\\&lt;\\(\\(io\\.\\)?debug(.+)\\)" 1 font-lock-preprocessor-face t)))
803 </pre>
804 </p>
805 <ul>
806 <li>
807 <a href="src/pseudo.py">pseudo.py</a>
808 </li>
809 <li>
810 <a href="http://nltk.org/doc/en/structured-programming.html">http://nltk.org/doc/en/structured-programming.html</a> recursive dynamic
811 </li>
812 <li>
813 <a href="http://nltk.org/doc/en/advanced-parsing.html">http://nltk.org/doc/en/advanced-parsing.html</a>
814 </li>
815 <li>
816 <a href="http://jaynes.colorado.edu/PythonIdioms.html">http://jaynes.colorado.edu/PythonIdioms.html</a>
820 </li>
821 </ul>
822 </div>
824 </div>
826 <div id="outline-container-12" class="outline-2">
827 <h2 id="sec-12">12 Git</h2>
828 <div id="text-12">
830 <p>Repository web page: <a href="http://repo.or.cz/w/dmvccm.git">http://repo.or.cz/w/dmvccm.git</a>
831 </p>
833 Setting up a new project:
834 <pre>
835 git init
836 git add .
837 git commit -m "first release"
838 </pre>
839 </p>
841 Later on: (<code>-a</code> does <code>git rm</code> and <code>git add</code> automatically)
842 <pre>
843 git init
844 git commit -a -m "some subsequent release"
845 </pre>
846 </p>
848 Then push stuff up to the remote server:
849 <pre>
850 git push git+ssh://username@repo.or.cz/srv/git/dmvccm.git master
851 </pre>
852 </p>
854 (<code>eval `ssh-agent`</code> and <code>ssh-add</code> to avoid having to type in keyphrase all
855 the time)
856 </p>
858 Make a copy of the (remote) master branch:
859 <pre>
860 git clone git://repo.or.cz/dmvccm.git
861 </pre>
862 </p>
864 Make and name a new branch in this folder
865 <pre>
866 git checkout -b mybranch
867 </pre>
868 </p>
870 To save changes in <code>mybranch</code>:
871 <pre>
872 git commit -a
873 </pre>
874 </p>
876 Go back to the master branch (uncommitted changes from <code>mybranch</code> are
877 carried over):
878 <pre>
879 git checkout master
880 </pre>
881 </p>
883 Try out:
884 <pre>
885 git add --interactive
886 </pre>
887 </p>
889 Good tutorial:
890 <a href="http://www-cs-students.stanford.edu/~blynn//gitmagic/">http://www-cs-students.stanford.edu/~blynn//gitmagic/</a>
891 </p></div>
892 </div>
893 <div id="postamble"><p class="author"> Author: Kevin Brubeck Unhammer
894 <a href="mailto:K.BrubeckUnhammer at student uva nl ">&lt;K.BrubeckUnhammer at student uva nl &gt;</a>
895 </p>
896 <p class="date"> Date: 2008/06/18 10:34:09</p>
897 </div><p class="postamble">Skrive vha. emacs + <a href='http://orgmode.org/'>org-mode</a></p></body>
898 </html>