DMVCCM.html

   1 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
   2                "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
   3 <html xmlns="http://www.w3.org/1999/xhtml"
   4 lang="en" xml:lang="en">
   5 <head>
   6 <title>DMV/CCM &ndash; todo-list / progress</title>
   7 <meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>
   8 <meta name="generator" content="Org-mode"/>
   9 <meta name="generated" content="2008/06/18 10:34:09"/>
  10 <meta name="author" content="Kevin Brubeck Unhammer"/>
  11 <link rel="stylesheet" type="text/css" href="http://www.student.uib.no/~kun041/org.css">
  12 </head><body>
  13 <h1 class="title">DMV/CCM &ndash; todo-list / progress</h1>
  14
  15
  16
  17 <div id="table-of-contents">
  18 <h2>Table of Contents</h2>
  19 <div id="text-table-of-contents">
  20 <ul>
  21 <li><a href="#sec-1">1 dmvccm report and project</a></li>
  22 <li><a href="#sec-2">2 P_STOP and P_CHOOSE for IO/EM (reestimation)</a>
  23 <ul>
  24 <li><a href="#sec-2.1">2.1 L&amp;Y formula (20) or c()-formula?</a></li>
  25 <li><a href="#sec-2.2">2.2 [#A] P_CHOOSE formula:</a>
  26 <ul>
  27 <li><a href="#sec-2.2.1">2.2.1 Problem with this formula:</a></li>
  28 </ul>
  29 </li>
  30 <li><a href="#sec-2.3">2.3 P_STOP formulas for various dir and adj:</a></li>
  31 </ul>
  32 </li>
  33 <li><a href="#sec-3">3 outer probabilities</a></li>
  34 <li><a href="#sec-4">4 Find Most Probable Parse of given test sentence</a></li>
  35 <li><a href="#sec-5">5 Combine CCM with DMV</a>
  36 <ul>
  37 <li><a href="#sec-5.1">5.1 Make inner() and outer() also allow left-first attachment</a></li>
  38 </ul>
  39 </li>
  40 <li><a href="#sec-6">6 Get a dependency parsed version of WSJ to test with</a></li>
  41 <li><a href="#sec-7">7 Initialization   </a>
  42 <ul>
  43 <li><a href="#sec-7.1">7.1 CCM Initialization    </a></li>
  44 <li><a href="#sec-7.2">7.2 Separate initialization to another file?</a></li>
  45 <li><a href="#sec-7.3">7.3 DMV Initialization probabilities</a></li>
  46 <li><a href="#sec-7.4">7.4 DMV Initialization frequencies    </a>
  47 <ul>
  48 <li><a href="#sec-7.4.1">7.4.1 P_STOP    </a></li>
  49 <li><a href="#sec-7.4.2">7.4.2 P_CHOOSE</a></li>
  50 </ul></li>
  51 </ul>
  52 </li>
  53 <li><a href="#sec-8">8 [#C] Deferred</a>
  54 <ul>
  55 <li><a href="#sec-8.1">8.1 when reestimating P_STOP etc, remove rules with p &lt; epsilon</a></li>
  56 <li><a href="#sec-8.2">8.2 inner_dmv, short ranges and impossible attachment</a></li>
  57 <li><a href="#sec-8.3">8.3 clean up the module files</a></li>
  58 <li><a href="#sec-8.4">8.4 Some (tagged) sentences are bound to come twice</a></li>
  59 <li><a href="#sec-8.5">8.5 tags as numbers or tags as strings?</a></li>
  60 </ul>
  61 </li>
  62 <li><a href="#sec-9">9 Adjacency and combining it with the inside-outside algorithm</a>
  63 <ul>
  64 <li><a href="#sec-9.1">9.1 Possible alternate type of adjacency</a></li>
  65 </ul>
  66 </li>
  67 <li><a href="#sec-10">10 Expectation Maximation in IO/DMV-terms</a></li>
  68 <li><a href="#sec-11">11 Python-stuff</a></li>
  69 <li><a href="#sec-12">12 Git</a></li>
  70 </ul>
  71 </div>
  72 </div>
  73
  74 <div id="outline-container-1" class="outline-2">
  75 <h2 id="sec-1">1 dmvccm report and project</h2>
  76 <div id="text-1">
  77
  78 <p><span class="timestamp-kwd">DEADLINE: </span> <span class="timestamp">2008-06-30 Mon</span><br/>
  79 </p><ul>
  80 <li>
  81 <a href="src/dmv.py">dmv.py</a>
  82 </li>
  83 <li>
  84 <a href="src/io.py">io.py</a>
  85 </li>
  86 <li>
  87 <a href="src/harmonic.py">harmonic.py</a>
  88
  89 </li>
  90 </ul>
  91
  92 <p>(But absolute, extended, really-quite-dead-now deadline: August 31&hellip;)
  93 </p>
  94 <p>
  95 <a href="http://www.student.uib.no/~kun041/dmvccm/DMVCCM_archive.html">Archived entries</a> from this file.
  96 </p></div>
  97
  98 </div>
  99
 100 <div id="outline-container-2" class="outline-2">
 101 <h2 id="sec-2">2 P_STOP and P_CHOOSE for IO/EM (reestimation)</h2>
 102 <div id="text-2">
 103
 104 <p><a href="src/dmv.py">dmv-P_STOP</a>
 105 Remember: The P<sub>STOP</sub> formula is upside-down (left-to-right also).
 106 (In the article..not the <a href="http://www.eecs.berkeley.edu/~klein/papers/klein_thesis.pdf">thesis</a>)
 107 </p>
 108
 109 </div>
 110
 111 <div id="outline-container-2.1" class="outline-3">
 112 <h3 id="sec-2.1">2.1 <span class="todo">TOGROK</span> L&amp;Y formula (20) or c()-formula?</h3>
 113 <div id="text-2.1">
 114
 115 <p>At first glance, I've been doing <br/>
 116 c(s,r,_a_) / [ c(r+1,t, h_) * c(s,t, h_) ] <br/>
 117 &lt;=&gt;
 118 </p><table border="2" cellspacing="0" cellpadding="6" rules="groups" frame="hsides">
 119 <col align="left"></col>
 120 <tbody>
 121 <tr><td>1/P_sent * e(s,r,_a_) * f(s,r,_a_)</td></tr>
 122 <tr><td>[ 1/P_sent * e(r+1,t,h_) * f(r+t,t,h_) * 1/P_sent * e(s,t, h_) * f(s,t,h_) ]</td></tr>
 123 </tbody>
 124 </table>
 125
 126
 127 <p>
 128 while L&amp;Y uses
 129 </p><table border="2" cellspacing="0" cellpadding="6" rules="groups" frame="hsides">
 130 <col align="left"></col>
 131 <tbody>
 132 <tr><td>w_sent = 1/P_sent * prob(h_-&gt;_a_ h_) * e(s,r,_a_) * e(r+1,t, h_) * f(s,t,h_)</td></tr>
 133 <tr><td>v_sent = 1/P_sent * e(s,t,h_) * f(s,t,h_) = c(s,t,h_)</td></tr>
 134 </tbody>
 135 </table>
 136
 137
 138 <p>
 139 Just for completion: <br/>
 140 PSTOP =eg. c(s,t,_h_) / c(s,t,h_) for certain s,t <br/>
 141 &lt;=&gt;
 142 </p><table border="2" cellspacing="0" cellpadding="6" rules="groups" frame="hsides">
 143 <col align="left"></col>
 144 <tbody>
 145 <tr><td>1/P_sent * e(s,t,_h_) * f(s,t,_h_)</td></tr>
 146 <tr><td>1/P_sent * e(s,t, h_) * f(s,t, h_)</td></tr>
 147 </tbody>
 148 </table>
 149
 150
 151 <p>
 152 vs what seems to be the direct L&amp;Y translation:
 153 </p>
 154 <table border="2" cellspacing="0" cellpadding="6" rules="groups" frame="hsides">
 155 <col align="left"></col>
 156 <tbody>
 157 <tr><td>1/P_sent * prob(_ h_ -&gt; STOP h_) * e(s,t,h_) * f(s,t,_h_)</td></tr>
 158 <tr><td>1/P_sent * e(s,t, h_) * f(s,t, h_)</td></tr>
 159 </tbody>
 160 </table>
 161
 162
 163 <p>
 164 &hellip;but if we used that, e(s,t,h_) would cancel out and we'd just be
 165 dividing the outer scores * rule-probability.. hmm.. Also, there's the
 166 problem of what 'rule-probability' means here, since there's on the
 167 one hand the CNF-rule probability, and on the other hand PSTOP and
 168 PCHOOSE &ndash; in stop rules, PSTOP is the rule-probability, while for
 169 attachment rules, it's PCHOOSE*(1-PSTOP).
 170 </p></div>
 171
 172 </div>
 173
 174 <div id="outline-container-2.2" class="outline-3">
 175 <h3 id="sec-2.2">2.2 <span class="todo">TODO</span> [#A] P_CHOOSE formula:</h3>
 176 <div id="text-2.2">
 177
 178 <ul>
 179 <li>
 180 Note taken on <span class="timestamp">2008-06-16 Mon 12:33</span> <br/>
 181 Implemented as assumed below, but this does not give 1-summing
 182 probabilities, why? Are we checking the right configurations? Eg,
 183 is it enough to just check c(s,t,h_), c(s,r,_a_) and
 184 c(r+1,t,h_) for left attachments??
 185
 186 </li>
 187 </ul>
 188
 189 <p>So far assuming this:
 190 </p>
 191 <table border="2" cellspacing="0" cellpadding="6" rules="groups" frame="hsides">
 192 <col align="left"></col><col align="left"></col>
 193 <tbody>
 194 <tr><td>P<sub>CHOOSE</sub>(a : h,R) =</td><td>&sum;<sub>corpus</sub> &sum;<sub>s=loc(h)</sub> &sum;<sub>t &gt; loc(h)</sub> &sum;<sub>loc(h) &lt; r &lt;= t</sub> c(r,t,_a_)</td></tr>
 195 <tr><td></td><td>&sum;<sub>corpus</sub> &sum;<sub>s=loc(h)</sub> &sum;<sub>t &gt; loc(h)</sub> &sum;<sub>loc(h) &lt; r &lt;= t</sub> c(s,t,h) * c(s, r-1, h_)</td></tr>
 196 <tr><td></td><td></td></tr>
 197 <tr><td>P<sub>CHOOSE</sub>(a : h,L) =</td><td>&sum;<sub>corpus</sub> &sum;<sub>s&lt;loc(h)</sub> &sum;<sub>t&gt;=loc(h)</sub> &sum;<sub>r&lt;loc(h)</sub> c(s,r,_a_)</td></tr>
 198 <tr><td></td><td>&sum;<sub>corpus</sub> &sum;<sub>s&lt;loc(h)</sub> &sum;<sub>t&gt;=loc(h)</sub> &sum;<sub>r&lt;loc(h)</sub> c(s,t,h_) * c(r+1, t, h_)</td></tr>
 199 </tbody>
 200 </table>
 201
 202 t &gt;= loc(h) since there are many possibilites for right-attachments
 203 below, and each of them alone gives a lower probability (through
 204 multiplication) to the upper tree (so add them all)
 205
 206 <p>
 207 The reason we have to check <i>both</i> children of the attachments is that we
 208 have to make sure they are contiguous (otherwise we would have no way
 209 of ruling out eg. h_-&gt;_b_,_b_-&gt;b_-&gt;_a_, where h_ covers <b>s</b> and <b>t</b>,_b_ is
 210 from <b>s</b> to <b>x&lt;r</b> and _ a_ is from <b>s</b> to <b>r</b>).
 211 </p>
 212 </div>
 213
 214 <div id="outline-container-2.2.1" class="outline-4">
 215 <h4 id="sec-2.2.1">2.2.1 <span class="todo">TOGROK</span> Problem with this formula:</h4>
 216 <div id="text-2.2.1">
 217
 218 <p>On calculating P<sub>CHOOSE</sub>(det | vbd, L) from the 1-sentence corpus
 219 "det nn vbd", there are two ways in which 'det' could be left
 220 attached; one is where it is attached non-adjacently (after 'vbd' has
 221 attached 'nn'):
 222 <pre>
 223 &gt;&gt;&gt; c_L = c(0,0,(SEAL,g.tagnum('det')),0,g,'det nn vbd'.split(),{},{})
 224 &gt;&gt;&gt; c_L
 225 0.62669683257918563 # so far so good
 226
 227 &gt;&gt;&gt; c_R = c(1,2,(RGO_L,g.tagnum('vbd')),2,g,'det nn vbd'.split(),{},{})
 228 &gt;&gt;&gt; c_R
 229 0.11312217194570134 # still seeems an OK probability
 230
 231 &gt;&gt;&gt; c_M = c(0,2,(RGO_L,g.tagnum('vbd')),2,g,'det nn vbd'.split(),{},{})
 232 &gt;&gt;&gt; c_M
 233 0.31674208144796384 # and this seems good
 234
 235 &gt;&gt;&gt; c_L / (c_R * c_M)
 236 17.490571428571432  # but this is Way off...
 237 </pre>
 238 </p>
 239 </div>
 240 </div>
 241
 242 </div>
 243
 244 <div id="outline-container-2.3" class="outline-3">
 245 <h3 id="sec-2.3">2.3 <span class="done">DONE</span> P_STOP formulas for various dir and adj:</h3>
 246 <div id="text-2.3">
 247
 248 <p><span class="timestamp-kwd">CLOSED: </span> <span class="timestamp">2008-06-15 Sun 23:40</span><br/>
 249 Assuming this:
 250 </p>
 251 <table border="2" cellspacing="0" cellpadding="6" rules="groups" frame="hsides">
 252 <col align="left"></col><col align="left"></col>
 253 <tbody>
 254 <tr><td>P<sub>STOP</sub>(STOP : h,L,non_adj) =</td><td>&sum;<sub>corpus</sub> &sum;<sub>s&lt;loc(h)</sub> &sum;<sub>t&gt;=loc(h)</sub> c(s,t,_h_)</td></tr>
 255 <tr><td></td><td>&sum;<sub>corpus</sub> &sum;<sub>s&lt;loc(h)</sub> &sum;<sub>t&gt;=loc(h)</sub> c(s,t,h_)</td></tr>
 256 <tr><td></td><td></td></tr>
 257 <tr><td>P<sub>STOP</sub>(STOP : h,L,adj) =</td><td>&sum;<sub>corpus</sub> &sum;<sub>s=loc(h)</sub> &sum;<sub>t&gt;=loc(h)</sub> c(s,t,_h_)</td></tr>
 258 <tr><td></td><td>&sum;<sub>corpus</sub> &sum;<sub>s=loc(h)</sub> &sum;<sub>t&gt;=loc(h)</sub> c(s,t,h_)</td></tr>
 259 <tr><td></td><td></td></tr>
 260 <tr><td>P<sub>STOP</sub>(STOP : h,R,non_adj) =</td><td>&sum;<sub>corpus</sub> &sum;<sub>s=loc(h)</sub> &sum;<sub>t&gt;loc(h)</sub> c(s,t,h_)</td></tr>
 261 <tr><td></td><td>&sum;<sub>corpus</sub> &sum;<sub>s=loc(h)</sub> &sum;<sub>t&gt;loc(h)</sub> c(s,t,h)</td></tr>
 262 <tr><td></td><td></td></tr>
 263 <tr><td>P<sub>STOP</sub>(STOP : h,R,adj) =</td><td>&sum;<sub>corpus</sub> &sum;<sub>s=loc(h)</sub> &sum;<sub>t=loc(h)</sub> c(s,t,h_)</td></tr>
 264 <tr><td></td><td>&sum;<sub>corpus</sub> &sum;<sub>s=loc(h)</sub> &sum;<sub>t=loc(h)</sub> c(s,t,h)</td></tr>
 265 </tbody>
 266 </table>
 267
 268
 269 <p>
 270 (And P<sub>STOP</sub>(-STOP|&hellip;) = 1 - P<sub>STOP</sub>(STOP|&hellip;) )
 271 </p></div>
 272 </div>
 273
 274 </div>
 275
 276 <div id="outline-container-3" class="outline-2">
 277 <h2 id="sec-3">3 <span class="done">DONE</span> outer probabilities</h2>
 278 <div id="text-3">
 279
 280 <p><span class="timestamp-kwd">CLOSED: </span> <span class="timestamp">2008-06-12 Thu 11:11</span><br/>
 281 <a name="outer">&nbsp;</a>
 282 There are 6 different configurations here, based on what the above
 283 (mother) node is, and what the Node for which we're computing is.
 284 Here <b>r</b> is not between <b>s</b> and <b>t</b> as in inner(), but an <i>outer</i> index. <b>loc_N</b>
 285 is the location of the Node head in the sentence, <b>loc_m</b> for the head
 286 of the mother of Node.
 287 </p>
 288 <ul>
 289 <li>
 290 mother is a RIGHT-stop:
 291 <ul>
 292 <li>
 293 outer(<b>s, t</b>, mom.LHS, <b>loc_N</b>), no inner-call
 294 </li>
 295 <li>
 296 adjacent iff <b>t</b> == <b>loc_m</b>
 297 </li>
 298 </ul>
 299 </li>
 300 <li>
 301 mother is a  LEFT-stop:
 302 <ul>
 303 <li>
 304 outer(<b>s, t</b>, mom.LHS, <b>loc_N</b>), no inner-call
 305 </li>
 306 <li>
 307 adjacent iff <b>s</b> == <b>loc_m</b>
 308
 309 </li>
 310 </ul>
 311 </li>
 312 <li>
 313 Node is on the LEFT branch (mom.L == Node)
 314 <ul>
 315 <li>
 316 and mother is a LEFT attachment:
 317 <ul>
 318 <li>
 319 <b>loc_N</b> will be in the LEFT branch, can be anything here.
 320 </li>
 321 <li>
 322 In the RIGHT, non-attached, branch we find inner(<b>t+1, r</b>, mom.R,
 323 <b>loc_m</b>) for all possible <b>loc_m</b> in the right part of the sentence.
 324 </li>
 325 <li>
 326 outer(<b>s, r</b>, mom.LHS, <b>loc_m</b>).
 327 </li>
 328 <li>
 329 adjacent iff <b>t+1</b> == <b>loc_m</b>
 330 </li>
 331 </ul>
 332 </li>
 333 <li>
 334 and mother is a RIGHT attachment:
 335 <ul>
 336 <li>
 337 <b>loc_m</b> = <b>loc_N</b>.
 338 </li>
 339 <li>
 340 In the RIGHT, attached, branch we find inner(<b>t+1, r</b>, mom.R, <b>loc_R</b>) for
 341 all possible <b>loc_R</b> in the right part of the sentence.
 342 </li>
 343 <li>
 344 outer(<b>s, r</b>, mom.LHS, <b>loc_N</b>).
 345 </li>
 346 <li>
 347 adjacent iff <b>t</b> == <b>loc_m</b>
 348
 349 </li>
 350 </ul>
 351 </li>
 352 </ul>
 353 </li>
 354 <li>
 355 Node is on the RIGHT branch (mom.R == Node)
 356 <ul>
 357 <li>
 358 and mother is a LEFT attachment:
 359 <ul>
 360 <li>
 361 <b>loc_m</b> = <b>loc_N</b>.
 362 </li>
 363 <li>
 364 In the LEFT, attached, branch we find inner(<b>r, s-1</b>, mom.L, <b>loc_L</b>) for
 365 all possible <b>loc_L</b> in the left part of the sentence.
 366 </li>
 367 <li>
 368 outer(<b>r, t</b>, mom.LHS, <b>loc_m</b>).
 369 </li>
 370 <li>
 371 adjacent iff <b>s</b> == <b>loc_m</b>
 372 </li>
 373 </ul>
 374 </li>
 375 <li>
 376 and mother is a RIGHT attachment:
 377 <ul>
 378 <li>
 379 <b>loc_N</b> will be in the RIGHT branch, can be anything here.
 380 </li>
 381 <li>
 382 In the LEFT, non-attached, branch we find inner(<b>r, s-1</b>, mom.L, <b>loc_m</b>) for
 383 all possible <b>loc_m</b> in the left part of the sentence.
 384 </li>
 385 <li>
 386 outer(<b>r, t</b>, mom.LHS, <b>loc_N</b>).
 387 </li>
 388 <li>
 389 adjacent iff <b>s-1</b> == <b>loc_m</b>
 390
 391 </li>
 392 </ul>
 393 </li>
 394 </ul>
 395 </li>
 396 </ul>
 397
 398 <p><img src="outer_attachments.jpg"/>
 399 </p>
 400 <p>
 401 <pre>
 402  in notes:   in code (constants):    in Klein thesis:
 403 -------------------------------------------------------------------
 404  _h_         SEAL                    bar over h
 405   h_         RGO_L                   right-under-left-arrow over h
 406   h          GO_R                    right-arrow over h
 407
 408              LGO_R                   left-under-right-arrow over h
 409              GO_L                    left-arrow over h
 410 </pre>
 411 </p>
 412 <p>
 413 Also, unlike in <a href="http://bibsonomy.org/bibtex/2b9f6798bb092697da7042ca3f5dee795">Lari & Young</a>, non-ROOT ('S') symbols may cover the
 414 whole sentence, but ROOT may <i>only</i> appear if it covers the whole
 415 sentence.
 416 </p>
 417
 418 </div>
 419
 420 </div>
 421
 422 <div id="outline-container-4" class="outline-2">
 423 <h2 id="sec-4">4 <span class="todo">TOGROK</span> Find Most Probable Parse of given test sentence</h2>
 424 <div id="text-4">
 425
 426 <p>We could probably do it pretty easily from the chart:
 427 </p><ul>
 428 <li>
 429 for all loc_h, select the (0,len(sent),ROOT,loc_h) that has the highest p
 430 </li>
 431 <li>
 432 for stops, just keep going
 433 </li>
 434 <li>
 435 for rewrites, for loc_dep select the one that has the highest p
 436 </li>
 437 </ul>
 438 </div>
 439
 440 </div>
 441
 442 <div id="outline-container-5" class="outline-2">
 443 <h2 id="sec-5">5 <span class="todo">TOGROK</span> Combine CCM with DMV</h2>
 444 <div id="text-5">
 445
 446 <p>Page 108 (pdf: 124) of <a href="http://www.eecs.berkeley.edu/~klein/papers/klein_thesis.pdf">Klein's thesis</a> gives info about this.
 447 </p>
 448 </div>
 449
 450 <div id="outline-container-5.1" class="outline-3">
 451 <h3 id="sec-5.1">5.1 <span class="todo">TODO</span> Make inner() and outer() also allow left-first attachment</h3>
 452 <div id="text-5.1">
 453
 454 <p>Using P<sub>ORDER</sub>(<i>left-first</i> | w) etc.
 455 </p></div>
 456 </div>
 457
 458 </div>
 459
 460 <div id="outline-container-6" class="outline-2">
 461 <h2 id="sec-6">6 <span class="todo">TODO</span> Get a dependency parsed version of WSJ to test with</h2>
 462 <div id="text-6">
 463
 464 </div>
 465
 466 </div>
 467
 468 <div id="outline-container-7" class="outline-2">
 469 <h2 id="sec-7">7 Initialization   </h2>
 470 <div id="text-7">
 471
 472 <p><a href="/Users/kiwibird/Documents/Skole/V08/Probability/dmvccm/src/dmv.py">dmv-inits</a>
 473 </p>
 474 <p>
 475 We go through the corpus, since the probabilities are based on how far
 476 away in the sentence arguments are from their heads.
 477 </p>
 478 </div>
 479
 480 <div id="outline-container-7.1" class="outline-3">
 481 <h3 id="sec-7.1">7.1 <span class="todo">TOGROK</span> CCM Initialization    </h3>
 482 <div id="text-7.1">
 483
 484 <p>P<sub>SPLIT</sub> used here&hellip; how, again?
 485 </p></div>
 486
 487 </div>
 488
 489 <div id="outline-container-7.2" class="outline-3">
 490 <h3 id="sec-7.2">7.2 <span class="done">DONE</span> Separate initialization to another file?                      &nbsp;&nbsp;&nbsp;<span class="tag">PRETTIER</span></h3>
 491 <div id="text-7.2">
 492
 493 <p><span class="timestamp-kwd">CLOSED: </span> <span class="timestamp">2008-06-08 Sun 12:51</span><br/>
 494 <a href="src/harmonic.py">harmonic.py</a>
 495 </p></div>
 496
 497 </div>
 498
 499 <div id="outline-container-7.3" class="outline-3">
 500 <h3 id="sec-7.3">7.3 <span class="done">DONE</span> DMV Initialization probabilities</h3>
 501 <div id="text-7.3">
 502
 503 <p>(from initialization frequency)
 504 </p></div>
 505
 506 </div>
 507
 508 <div id="outline-container-7.4" class="outline-3">
 509 <h3 id="sec-7.4">7.4 <span class="done">DONE</span> DMV Initialization frequencies    </h3>
 510 <div id="text-7.4">
 511
 512 <p><span class="timestamp-kwd">CLOSED: </span> <span class="timestamp">2008-05-27 Tue 20:04</span><br/>
 513 </p>
 514 </div>
 515
 516 <div id="outline-container-7.4.1" class="outline-4">
 517 <h4 id="sec-7.4.1">7.4.1 P_STOP    </h4>
 518 <div id="text-7.4.1">
 519
 520 <p>P<sub>STOP</sub> is not well defined by K&amp;M. One possible interpretation given
 521 the sentence [det nn vb nn] is
 522 <pre>
 523  f_{STOP}( STOP|det, L, adj) +1
 524  f_{STOP}(-STOP|det, L, adj) +0
 525  f_{STOP}( STOP|det, L, non_adj) +1
 526  f_{STOP}(-STOP|det, L, non_adj) +0
 527  f_{STOP}( STOP|det, R, adj) +0
 528  f_{STOP}(-STOP|det, R, adj) +1
 529
 530  f_{STOP}( STOP|nn, L, adj) +0
 531  f_{STOP}(-STOP|nn, L, adj) +1
 532  f_{STOP}( STOP|nn, L, non_adj) +1  # since there's at least one to the left
 533  f_{STOP}(-STOP|nn, L, non_adj) +0
 534 </pre>
 535 </p><ul>
 536 <li id="sec-7.4.1.1"><span class="todo">TODO</span> tweak<br/>
 537 <pre>
 538             f[head,  'STOP', 'LN'] += (i_h &lt;= 1)     # first two words
 539             f[head, '-STOP', 'LN'] += (not i_h &lt;= 1)
 540             f[head,  'STOP', 'LA'] += (i_h == 0)     # very first word
 541             f[head, '-STOP', 'LA'] += (not i_h == 0)
 542             f[head,  'STOP', 'RN'] += (i_h &gt;= n - 2) # last two words
 543             f[head, '-STOP', 'RN'] += (not i_h &gt;= n - 2)
 544             f[head,  'STOP', 'RA'] += (i_h == n - 1) # very last word
 545             f[head, '-STOP', 'RA'] += (not i_h == n - 1)
 546 </pre>
 547 vs
 548 <pre>
 549             # this one requires some additional rewriting since it
 550             # introduces divisions by zero
 551             f[head,  'STOP', 'LN'] += (i_h == 1)     # second word
 552             f[head, '-STOP', 'LN'] += (not i_h &lt;= 1) # not first two
 553             f[head,  'STOP', 'LA'] += (i_h == 0)     # first word
 554             f[head, '-STOP', 'LA'] += (not i_h == 0) # not first
 555             f[head,  'STOP', 'RN'] += (i_h == n - 2)     # second-to-last
 556             f[head, '-STOP', 'RN'] += (not i_h &gt;= n - 2) # not last two
 557             f[head,  'STOP', 'RA'] += (i_h == n - 1)     # last word
 558             f[head, '-STOP', 'RA'] += (not i_h == n - 1) # not last
 559 </pre>
 560 vs
 561 <pre>
 562             f[head,  'STOP', 'LN'] += (i_h == 1)     # second word
 563             f[head, '-STOP', 'LN'] += (not i_h == 1) # not second
 564             f[head,  'STOP', 'LA'] += (i_h == 0)     # first word
 565             f[head, '-STOP', 'LA'] += (not i_h == 0) # not first
 566             f[head,  'STOP', 'RN'] += (i_h == n - 2)     # second-to-last
 567             f[head, '-STOP', 'RN'] += (not i_h == n - 2) # not second-to-last
 568             f[head,  'STOP', 'RA'] += (i_h == n - 1)     # last word
 569             f[head, '-STOP', 'RA'] += (not i_h == n - 1) # not last
 570 </pre>
 571 vs
 572 "all words take the same number of arguments" interpreted as
 573 <pre>
 574 for all heads:
 575     p_STOP(head, 'STOP', 'LN') = 0.3
 576     p_STOP(head, 'STOP', 'LA') = 0.5
 577     p_STOP(head, 'STOP', 'RN') = 0.4
 578     p_STOP(head, 'STOP', 'RA') = 0.7
 579 </pre>
 580 (which we easily may tweak in init_zeros())
 581 </li>
 582 </ul>
 583 </div>
 584
 585 </div>
 586
 587 <div id="outline-container-7.4.2" class="outline-4">
 588 <h4 id="sec-7.4.2">7.4.2 P_CHOOSE</h4>
 589 <div id="text-7.4.2">
 590
 591 <p>Go through the corpus, counting distances between heads and
 592 arguments. In [det nn vb nn], we give
 593 </p><ul>
 594 <li>
 595 f<sub>CHOOSE</sub>(nn|det, R) +1/1 + C
 596 </li>
 597 <li>
 598 f<sub>CHOOSE</sub>(vb|det, R) +1/2 + C
 599 </li>
 600 <li>
 601 f<sub>CHOOSE</sub>(nn|det, R) +1/3 + C
 602 <ul>
 603 <li>
 604 If this were the full corpus, P<sub>CHOOSE</sub>(nn|det, R) would have
 605 (1+1/3+2C) / sum_a f<sub>CHOOSE</sub>(a|det, R)
 606
 607 </li>
 608 </ul>
 609 </li>
 610 </ul>
 611
 612 <p>The ROOT gets "each argument with equal probability", so in a sentence
 613 of three words, 1/3 for each (in [nn vb nn], 'nn' gets 2/3). Basically
 614 just a frequency count of the corpus&hellip;
 615 </p>
 616 <p>
 617 In a sense there are no terminal probabilities, since an <i>h</i> can only
 618 rewrite to an 'h' anyway (it's just a check for whether, at this
 619 location in the sentence, we have the right POS-tag).
 620 </p></div>
 621 </div>
 622 </div>
 623
 624 </div>
 625
 626 <div id="outline-container-8" class="outline-2">
 627 <h2 id="sec-8">8 [#C] Deferred</h2>
 628 <div id="text-8">
 629
 630 <p><a href="http://wiki.python.org/moin/PythonSpeed/PerformanceTips">http://wiki.python.org/moin/PythonSpeed/PerformanceTips</a> Eg., use
 631 map/reduce/filter/[i for i in [i's]]/(i for i in [i's]) instead of
 632 for-loops; use local variables for globals (global variables or or
 633 functions), etc.
 634 </p>
 635
 636 </div>
 637
 638 <div id="outline-container-8.1" class="outline-3">
 639 <h3 id="sec-8.1">8.1 <span class="todo">TODO</span> when reestimating P_STOP etc, remove rules with p &lt; epsilon   &nbsp;&nbsp;&nbsp;<span class="tag">OPTIMIZE</span></h3>
 640 <div id="text-8.1">
 641
 642 </div>
 643
 644 </div>
 645
 646 <div id="outline-container-8.2" class="outline-3">
 647 <h3 id="sec-8.2">8.2 <span class="todo">TODO</span> inner_dmv, short ranges and impossible attachment             &nbsp;&nbsp;&nbsp;<span class="tag">OPTIMIZE</span></h3>
 648 <div id="text-8.2">
 649
 650 <p>If s-t &lt;= 2, there can be only one attachment below, so don't recurse
 651 with both Lattach=True and Rattach=True.
 652 </p>
 653 <p>
 654 If s-t &lt;= 1, there can be no attachment below, so only recurse with
 655 Lattach=False, Rattach=False.
 656 </p>
 657 <p>
 658 Put this in the loop under rewrite rules (could also do it in the STOP
 659 section, but that would only have an effect on very short sentences).
 660 </p></div>
 661
 662 </div>
 663
 664 <div id="outline-container-8.3" class="outline-3">
 665 <h3 id="sec-8.3">8.3 <span class="todo">TODO</span> clean up the module files                                     &nbsp;&nbsp;&nbsp;<span class="tag">PRETTIER</span></h3>
 666 <div id="text-8.3">
 667
 668 <p>Is there better way to divide dmv and harmonic? There's a two-way
 669 dependency between the modules. Guess there could be a third file that
 670 imports both the initialization and the actual EM stuff, while a file
 671 containing constants and classes could be imported by all others:
 672 <pre>
 673  dmv.py imports dmv_EM.py imports dmv_classes.py
 674  dmv.py imports dmv_inits.py imports dmv_classes.py
 675 </pre>
 676 </p>
 677 </div>
 678
 679 </div>
 680
 681 <div id="outline-container-8.4" class="outline-3">
 682 <h3 id="sec-8.4">8.4 <span class="todo">TOGROK</span> Some (tagged) sentences are bound to come twice             &nbsp;&nbsp;&nbsp;<span class="tag">OPTIMIZE</span></h3>
 683 <div id="text-8.4">
 684
 685 <p>Eg, first sort and count, so that the corpus
 686 [['nn','vbd','det','nn'],
 687 ['vbd','nn','det','nn'],
 688 ['nn','vbd','det','nn']]
 689 becomes
 690 [(['nn','vbd','det','nn'],2),
 691 (['vbd','nn','det','nn'],1)]
 692 and then in each loop through sentences, make sure we handle the
 693 frequency correctly.
 694 </p>
 695 <p>
 696 Is there much to gain here?
 697 </p>
 698 </div>
 699
 700 </div>
 701
 702 <div id="outline-container-8.5" class="outline-3">
 703 <h3 id="sec-8.5">8.5 <span class="todo">TOGROK</span> tags as numbers or tags as strings?                         &nbsp;&nbsp;&nbsp;<span class="tag">OPTIMIZE</span></h3>
 704 <div id="text-8.5">
 705
 706 <p>Need to clean up the representation.
 707 </p>
 708 <p>
 709 Stick with tag-strings in initialization then switch to numbers for
 710 IO-algorithm perhaps? Can probably afford more string-matching in
 711 initialization..
 712 </p></div>
 713 </div>
 714
 715 </div>
 716
 717 <div id="outline-container-9" class="outline-2">
 718 <h2 id="sec-9">9 Adjacency and combining it with the inside-outside algorithm</h2>
 719 <div id="text-9">
 720
 721 <p>Each DMV_Rule has both a probN and a probA, for adjacencies. inner()
 722 and outer() needs the correct one in each case.
 723 </p>
 724 <p>
 725 In each inner() call, loc_h is the location of the head of this
 726 dependency structure. In each outer() call, it's the head of the <i>Node</i>,
 727 the structure we're looking outside of.
 728 </p>
 729 <p>
 730 We call inner() for each location of a head, and on each terminal,
 731 loc_h must equal s (and t). In the recursive attachment calls, we use
 732 the locations (sentence indices) of words to the left or right of the
 733 head in calls to inner(). <i>loc_h lets us check whether we need probN or probA</i>.
 734 </p>
 735 </div>
 736
 737 <div id="outline-container-9.1" class="outline-3">
 738 <h3 id="sec-9.1">9.1 Possible alternate type of adjacency</h3>
 739 <div id="text-9.1">
 740
 741 <p>K&amp;M's adjacency is just whether or not an argument has been generated
 742 in the current direction yet. One could also make a stronger type of
 743 adjacency, where h and a are not adjacent if b is in between, eg. with
 744 the sentence "a b h" and the structure ((h-&gt;a), (a-&gt;b)), h is
 745 K&amp;M-adjacent to a, but not next to a, since b is in between. It's easy
 746 to check this type of adjacency in inner(), but it needs new rules for
 747 P_STOP reestimation.
 748 </p></div>
 749 </div>
 750
 751 </div>
 752
 753 <div id="outline-container-10" class="outline-2">
 754 <h2 id="sec-10">10 Expectation Maximation in IO/DMV-terms</h2>
 755 <div id="text-10">
 756
 757 <p>outer(s,t,Node) and inner(s,t,Node) calculates the expected number of
 758 trees (CNF-)headed by Node from <b>s</b> to <b>t</b> (sentence positions). This uses
 759 the P_STOP and P_CHOOSE values, which have been conveniently
 760 distributed into CNF rules as probN and probA (non-adjacent and
 761 adjacent probabilites).
 762 </p>
 763 <p>
 764 When re-estimating, we use the expected values from outer() and
 765 inner() to get new values for P_STOP and P_CHOOSE. When we've
 766 re-estimated for the entire corpus, we distribute P_STOP and P_CHOOSE
 767 into the CNF rules again, so that in the next round we use new probN
 768 and probA to find outer- and inner-probabilites.
 769 </p>
 770 <p>
 771 The distribution of P_STOP and P_CHOOSE into CNF rules also happens in
 772 init_normalize() (here along with the creation of P_STOP and
 773 P_CHOOSE); P_STOP is used to create CNF rules where one branch of the
 774 rule is STOP, P_CHOOSE is used to create rules of the form
 775 <pre>
 776  h  -&gt; h  _a_
 777  h_ -&gt; h_ _a_
 778 </pre>
 779 </p>
 780 <p>
 781 Since "adjacency" is not captured in regular CNF rules, we need two
 782 probabilites for each rule, and outer() and inner() have to know when
 783 to use which.
 784 </p>
 785
 786 </div>
 787
 788 </div>
 789
 790 <div id="outline-container-11" class="outline-2">
 791 <h2 id="sec-11">11 Python-stuff</h2>
 792 <div id="text-11">
 793
 794 <p>Make those debug statements steal a bit less attention in emacs:
 795 <pre>
 796 (font-lock-add-keywords
 797  'python-mode                   ; not really regexp, a bit slow
 798  '(("^\\( *\\)\\(\\if +'.+' +in +io.DEBUG. *\\(
 799 \\1    .+$\\)+\\)" 2 font-lock-preprocessor-face t)))
 800 (font-lock-add-keywords
 801  'python-mode
 802  '(("\\&lt;\\(\\(io\\.\\)?debug(.+)\\)" 1 font-lock-preprocessor-face t)))
 803 </pre>
 804 </p>
 805 <ul>
 806 <li>
 807 <a href="src/pseudo.py">pseudo.py</a>
 808 </li>
 809 <li>
 810 <a href="http://nltk.org/doc/en/structured-programming.html">http://nltk.org/doc/en/structured-programming.html</a> recursive dynamic
 811 </li>
 812 <li>
 813 <a href="http://nltk.org/doc/en/advanced-parsing.html">http://nltk.org/doc/en/advanced-parsing.html</a>
 814 </li>
 815 <li>
 816 <a href="http://jaynes.colorado.edu/PythonIdioms.html">http://jaynes.colorado.edu/PythonIdioms.html</a>
 817
 818
 819
 820 </li>
 821 </ul>
 822 </div>
 823
 824 </div>
 825
 826 <div id="outline-container-12" class="outline-2">
 827 <h2 id="sec-12">12 Git</h2>
 828 <div id="text-12">
 829
 830 <p>Repository web page: <a href="http://repo.or.cz/w/dmvccm.git">http://repo.or.cz/w/dmvccm.git</a>
 831 </p>
 832 <p>
 833 Setting up a new project:
 834 <pre>
 835  git init
 836  git add .
 837  git commit -m "first release"
 838 </pre>
 839 </p>
 840 <p>
 841 Later on: (<code>-a</code> does <code>git rm</code> and <code>git add</code> automatically)
 842 <pre>
 843  git init
 844  git commit -a -m "some subsequent release"
 845 </pre>
 846 </p>
 847 <p>
 848 Then push stuff up to the remote server:
 849 <pre>
 850  git push git+ssh://username@repo.or.cz/srv/git/dmvccm.git master
 851 </pre>
 852 </p>
 853 <p>
 854 (<code>eval `ssh-agent`</code> and <code>ssh-add</code> to avoid having to type in keyphrase all
 855 the time)
 856 </p>
 857 <p>
 858 Make a copy of the (remote) master branch:
 859 <pre>
 860  git clone git://repo.or.cz/dmvccm.git
 861 </pre>
 862 </p>
 863 <p>
 864 Make and name a new branch in this folder
 865 <pre>
 866  git checkout -b mybranch
 867 </pre>
 868 </p>
 869 <p>
 870 To save changes in <code>mybranch</code>:
 871 <pre>
 872  git commit -a
 873 </pre>
 874 </p>
 875 <p>
 876 Go back to the master branch (uncommitted changes from <code>mybranch</code> are
 877 carried over):
 878 <pre>
 879  git checkout master
 880 </pre>
 881 </p>
 882 <p>
 883 Try out:
 884 <pre>
 885  git add --interactive
 886 </pre>
 887 </p>
 888 <p>
 889 Good tutorial:
 890 <a href="http://www-cs-students.stanford.edu/~blynn//gitmagic/">http://www-cs-students.stanford.edu/~blynn//gitmagic/</a>
 891 </p></div>
 892 </div>
 893 <div id="postamble"><p class="author"> Author: Kevin Brubeck Unhammer
 894 <a href="mailto:K.BrubeckUnhammer at student uva nl ">&lt;K.BrubeckUnhammer at student uva nl &gt;</a>
 895 </p>
 896 <p class="date"> Date: 2008/06/18 10:34:09</p>
 897 </div><p class="postamble">Skrive vha. emacs + <a href='http://orgmode.org/'>org-mode</a></p></body>
 898 </html>