DMVCCM.html

   1 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
   2                "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
   3 <html xmlns="http://www.w3.org/1999/xhtml"
   4 lang="en" xml:lang="en">
   5 <head>
   6 <title>DMV/CCM &ndash; todo-list / progress</title>
   7 <meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>
   8 <meta name="generator" content="Org-mode"/>
   9 <meta name="generated" content="2008/06/11 16:40:21"/>
  10 <meta name="author" content="Kevin Brubeck Unhammer"/>
  11 <link rel="stylesheet" type="text/css" href="http://www.student.uib.no/~kun041/org.css">
  12 </head><body>
  13 <h1 class="title">DMV/CCM &ndash; todo-list / progress</h1>
  14
  15
  16 <div id="table-of-contents">
  17 <h2>Table of Contents</h2>
  18 <div id="text-table-of-contents">
  19 <ul>
  20 <li><a href="#sec-1">1 dmvccm report and project</a></li>
  21 <li><a href="#sec-2">2 [#A] outer probabilities</a></li>
  22 <li><a href="#sec-3">3 [#B] P_STOP and P_CHOOSE for IO/EM (reestimation)</a></li>
  23 <li><a href="#sec-4">4 Find Most Probable Parse of given test sentence</a></li>
  24 <li><a href="#sec-5">5 Combine CCM with DMV</a></li>
  25 <li><a href="#sec-6">6 Get a dependency parsed version of WSJ to test with</a></li>
  26 <li><a href="#sec-7">7 Initialization   </a></li>
  27 <li><a href="#sec-8">8 [#C] Deferred</a></li>
  28 <li><a href="#sec-9">9 Adjacency and combining it with inner()</a></li>
  29 <li><a href="#sec-10">10 Expectation Maximation in IO/DMV-terms</a></li>
  30 <li><a href="#sec-11">11 Python-stuff</a></li>
  31 <li><a href="#sec-12">12 Git</a></li>
  32 </ul>
  33 </div>
  34 </div>
  35
  36 <div id="outline-container-1" class="outline-2">
  37 <h2 id="sec-1">1 dmvccm report and project</h2>
  38 <div id="text-1">
  39
  40 <p><span class="timestamp-kwd">DEADLINE: </span> <span class="timestamp">2008-06-30 Mon</span><br/>
  41 </p><ul>
  42 <li>
  43 <a href="src/dmv.py">dmv.py</a>
  44 </li>
  45 <li>
  46 <a href="src/io.py">io.py</a>
  47 </li>
  48 <li>
  49 <a href="src/harmonic.py">harmonic.py</a>
  50
  51 </li>
  52 </ul>
  53
  54 <p>(But absolute, extended, really-quite-dead-now deadline: August 31&hellip;)
  55 </p></div>
  56
  57 </div>
  58
  59 <div id="outline-container-2" class="outline-2">
  60 <h2 id="sec-2">2 <span class="todo">TODO</span> [#A] outer probabilities</h2>
  61 <div id="text-2">
  62
  63 <p>There are 6 different configurations here, based on what the above
  64 (mother) node is, and what the Node for which we're computing is.
  65 Here <b>r</b> is not between <b>s</b> and <b>t</b> as in inner(), but an <i>outer</i> index. <b>loc_N</b>
  66 is the location of the Node head in the sentence, <b>loc_m</b> for the head
  67 of the mother rule.
  68 </p>
  69 <ul>
  70 <li>
  71 mother is a RIGHT-stop:
  72 <ul>
  73 <li>
  74 outer(<b>s, t</b>, mom.LHS, <b>loc_N</b>), no inner-call
  75 </li>
  76 <li>
  77 adjacent iff <b>t</b> == <b>loc_m</b>
  78 </li>
  79 </ul>
  80 </li>
  81 <li>
  82 mother is a  LEFT-stop:
  83 <ul>
  84 <li>
  85 outer(<b>s, t</b>, mom.LHS, <b>loc_N</b>), no inner-call
  86 </li>
  87 <li>
  88 adjacent iff <b>s</b> == <b>loc_m</b>
  89
  90 </li>
  91 </ul>
  92 </li>
  93 <li>
  94 Node is on the LEFT branch (mom.L == Node)
  95 <ul>
  96 <li>
  97 and mother is a LEFT attachment:
  98 <ul>
  99 <li>
 100 <b>loc_N</b> will be in the LEFT branch, can be anything here.
 101 </li>
 102 <li>
 103 In the RIGHT, attached, branch we find inner(<b>t+1, r</b>, mom.R, <b>loc_m</b>) for
 104 all possible <b>loc_m</b> in the right part of the sentence.
 105 </li>
 106 <li>
 107 outer(<b>s, r</b>, mom.LHS, <b>loc_m</b>).
 108 </li>
 109 <li>
 110 adjacent iff <b>t+1</b> == <b>loc_m</b>
 111 </li>
 112 </ul>
 113 </li>
 114 <li>
 115 and mother is a RIGHT attachment:
 116 <ul>
 117 <li>
 118 <b>loc_m</b> = <b>loc_N</b>.
 119 </li>
 120 <li>
 121 In the RIGHT, attached, branch we find inner(<b>t+1, r</b>, mom.R, <b>loc_R</b>) for
 122 all possible <b>loc_R</b> in the right part of the sentence.
 123 </li>
 124 <li>
 125 outer(<b>s, r</b>, mom.LHS, <b>loc_N</b>).
 126 </li>
 127 <li>
 128 adjacent iff <b>t</b> == <b>loc_m</b>
 129
 130 </li>
 131 </ul>
 132 </li>
 133 </ul>
 134 </li>
 135 <li>
 136 Node is on the RIGHT branch (mom.R == Node)
 137 <ul>
 138 <li>
 139 and mother is a LEFT attachment:
 140 <ul>
 141 <li>
 142 <b>loc_m</b> = <b>loc_N</b>.
 143 </li>
 144 <li>
 145 In the LEFT, attached, branch we find inner(<b>r, s-1</b>, mom.L, <b>loc_L</b>) for
 146 all possible <b>loc_L</b> in the left part of the sentence.
 147 </li>
 148 <li>
 149 outer(<b>r, t</b>, mom.LHS, <b>loc_m</b>).
 150 </li>
 151 <li>
 152 adjacent iff <b>s</b> == <b>loc_m</b>
 153 </li>
 154 </ul>
 155 </li>
 156 <li>
 157 and mother is a RIGHT attachment:
 158 <ul>
 159 <li>
 160 <b>loc_N</b> will be in the RIGHT branch, can be anything here.
 161 </li>
 162 <li>
 163 In the RIGHT, attached, branch we find inner(<b>r, s-1</b>, mom.L, <b>loc_m</b>) for
 164 all possible <b>loc_m</b> in the right part of the sentence.
 165 </li>
 166 <li>
 167 outer(<b>r, t</b>, mom.LHS, <b>loc_N</b>).
 168 </li>
 169 <li>
 170 adjacent iff <b>s-1</b> == <b>loc_m</b>
 171
 172 </li>
 173 </ul>
 174 </li>
 175 </ul>
 176 </li>
 177 </ul>
 178
 179 <p><img src="outer_attachments.jpg"/>
 180 </p>
 181
 182
 183 </div>
 184
 185 </div>
 186
 187 <div id="outline-container-3" class="outline-2">
 188 <h2 id="sec-3">3 <span class="todo">TODO</span> [#B] P_STOP and P_CHOOSE for IO/EM (reestimation)</h2>
 189 <div id="text-3">
 190
 191 <p><a href="src/dmv.py">dmv-P_STOP</a>
 192 Remember: The P<sub>STOP</sub> formula is upside-down (left-to-right also).
 193 (In the article..not the thesis)
 194 </p>
 195 <ul>
 196 <li id="sec-3.1"><span class="todo">TOGROK</span> [#A] How do we only count from completed trees?<br/>
 197 The chart from inner() contains several entries where the only ways to
 198 reach them (from above) are through nodes that have zero probability;
 199 we don't want these to give counts for reestimation.
 200
 201 <p>
 202 prune() weeds out entries with only zero-probability mothers (or
 203 grandmothers), but is incredibly slow (reestimating with pruning takes
 204 almost 50 times longer).
 205 </p>
 206 <p>
 207 Another solution is to add
 208 </p></li>
 209 <li id="sec-3.2">P_STOP formulas for various dir and adj:<br/>
 210 Assuming this:
 211 <ul>
 212 <li>
 213 P<sub>STOP</sub>(STOP|h,L,non_adj) = &sum;<sub>corpus</sub> &sum;<sub>s&lt;loc(h)</sub> &sum;<sub>t</sub>
 214 inner(s,t,_h_, &hellip;) / &sum;<sub>corpus</sub> &sum;<sub>s&lt;loc(h)</sub> &sum;<sub>t</sub>
 215 inner(s,t,h_, &hellip;)
 216 </li>
 217 <li>
 218 P<sub>STOP</sub>(STOP|h,L,adj) = &sum;<sub>corpus</sub> &sum;<sub>s=loc(h)</sub> &sum;<sub>t</sub>
 219 inner(s,t,_h_, &hellip;) / &sum;<sub>corpus</sub> &sum;<sub>s=loc(h)</sub> &sum;<sub>t</sub>
 220 inner(s,t,h_, &hellip;)
 221 </li>
 222 <li>
 223 P<sub>STOP</sub>(STOP|h,R,non_adj) = &sum;<sub>corpus</sub> &sum;<sub>s</sub> &sum;<sub>t&gt;loc(h)</sub>
 224 inner(s,t,_h_, &hellip;) / &sum;<sub>corpus</sub> &sum;<sub>s</sub> &sum;<sub>t&gt;loc(h)</sub>
 225 inner(s,t,h_, &hellip;)
 226 </li>
 227 <li>
 228 P<sub>STOP</sub>(STOP|h,R,adj) = &sum;<sub>corpus</sub> &sum;<sub>s</sub> &sum;<sub>t=loc(h)</sub>
 229 inner(s,t,_h_, &hellip;) / &sum;<sub>corpus</sub> &sum;<sub>s</sub> &sum;<sub>t=loc(h)</sub>
 230 inner(s,t,h_, &hellip;)
 231
 232 </li>
 233 </ul>
 234
 235 <p>(And P<sub>STOP</sub>(-STOP|&hellip;) = 1 - P<sub>STOP</sub>(STOP|&hellip;) )
 236 </p></li>
 237 <li id="sec-3.3">P_CHOOSE formula:<br/>
 238 Assuming this:
 239 <ul>
 240 <li>
 241 P<sub>CHOOSE</sub>(a|h,L) = &sum;<sub>corpus</sub> &sum;<sub>s&lt;loc(h)</sub> &sum;<sub>t&gt;=loc(h)</sub> &sum;<sub>r&lt;t</sub>
 242 inner(s,r,_a_, &hellip;) / &sum;<sub>corpus</sub> &sum;<sub>s&lt;loc(h)</sub> &sum;<sub>t&gt;=loc(h)</sub>
 243 inner(s,t,h_,&hellip;)
 244 <ul>
 245 <li>
 246 t &gt;= loc(h) since there are many possibilites for
 247 right-attachments below, and each of them alone gives a lower
 248 probability (through multiplication) to the upper tree (so sum
 249 them all)
 250 </li>
 251 </ul>
 252 </li>
 253 <li>
 254 P<sub>CHOOSE</sub>(a|h,R) = &sum;<sub>corpus</sub> &sum;<sub>s=loc(h)</sub> &sum;<sub>r&gt;s</sub>
 255 inner(s,r,_a_,&hellip;) / &sum;<sub>corpus</sub> &sum;<sub>s=loc(h)</sub> &sum;<sub>t</sub>
 256 inner(s,t,h, &hellip;)
 257
 258 </li>
 259 </ul>
 260 </li>
 261 </ul>
 262 </div>
 263
 264 </div>
 265
 266 <div id="outline-container-4" class="outline-2">
 267 <h2 id="sec-4">4 <span class="todo">TOGROK</span> Find Most Probable Parse of given test sentence</h2>
 268 <div id="text-4">
 269
 270 <p>We could probably do it pretty easily from the chart:
 271 </p><ul>
 272 <li>
 273 for all loc_h, select the (0,len(sent),ROOT,loc_h) that has the highest p
 274 </li>
 275 <li>
 276 for stops, just keep going
 277 </li>
 278 <li>
 279 for rewrites, for loc_dep select the one that has the highest p
 280 </li>
 281 </ul>
 282 </div>
 283
 284 </div>
 285
 286 <div id="outline-container-5" class="outline-2">
 287 <h2 id="sec-5">5 <span class="todo">TOGROK</span> Combine CCM with DMV</h2>
 288 <div id="text-5">
 289
 290 </div>
 291
 292 </div>
 293
 294 <div id="outline-container-6" class="outline-2">
 295 <h2 id="sec-6">6 <span class="todo">TODO</span> Get a dependency parsed version of WSJ to test with</h2>
 296 <div id="text-6">
 297
 298 </div>
 299
 300 </div>
 301
 302 <div id="outline-container-7" class="outline-2">
 303 <h2 id="sec-7">7 Initialization   </h2>
 304 <div id="text-7">
 305
 306 <p><a href="/Users/kiwibird/Documents/Skole/V08/Probability/dmvccm/src/dmv.py">dmv-inits</a>
 307 </p>
 308 <p>
 309 We do have to go through the corpus, since the probabilities are based
 310 on how far away in the sentence arguments are from their heads.
 311 </p><ul>
 312 <li id="sec-7.1"><span class="todo">TOGROK</span> CCM Initialization    <br/>
 313 P<sub>SPLIT</sub> used here&hellip; how, again?
 314 </li>
 315 <li id="sec-7.2"><span class="done">DONE</span> Separate initialization to another file?                      &nbsp;&nbsp;&nbsp;<span class="tag">PRETTIER</span><br/>
 316 <span class="timestamp-kwd">CLOSED: </span> <span class="timestamp">2008-06-08 Sun 12:51</span><br/>
 317 <a href="src/harmonic.py">harmonic.py</a>
 318 </li>
 319 <li id="sec-7.3"><span class="done">DONE</span> DMV Initialization probabilities<br/>
 320 (from initialization frequency)
 321 </li>
 322 <li id="sec-7.4"><span class="done">DONE</span> DMV Initialization frequencies    <br/>
 323 <span class="timestamp-kwd">CLOSED: </span> <span class="timestamp">2008-05-27 Tue 20:04</span><br/>
 324 <ul>
 325 <li id="sec-7.4.1">P_STOP    <br/>
 326 P<sub>STOP</sub> is not well defined by K&amp;M. One possible interpretation given
 327 the sentence [det nn vb nn] is
 328 <pre>
 329  f_{STOP}( STOP|det, L, adj) +1
 330  f_{STOP}(-STOP|det, L, adj) +0
 331  f_{STOP}( STOP|det, L, non_adj) +1
 332  f_{STOP}(-STOP|det, L, non_adj) +0
 333  f_{STOP}( STOP|det, R, adj) +0
 334  f_{STOP}(-STOP|det, R, adj) +1
 335
 336  f_{STOP}( STOP|nn, L, adj) +0
 337  f_{STOP}(-STOP|nn, L, adj) +1
 338  f_{STOP}( STOP|nn, L, non_adj) +1  # since there's at least one to the left
 339  f_{STOP}(-STOP|nn, L, non_adj) +0
 340 </pre>
 341 <ul>
 342 <li id="sec-7.4.1.1"><span class="todo">TODO</span> tweak<br/>
 343 <pre>
 344             f[head,  'STOP', 'LN'] += (i_h &lt;= 1)     # first two words
 345             f[head, '-STOP', 'LN'] += (not i_h &lt;= 1)
 346             f[head,  'STOP', 'LA'] += (i_h == 0)     # very first word
 347             f[head, '-STOP', 'LA'] += (not i_h == 0)
 348             f[head,  'STOP', 'RN'] += (i_h &gt;= n - 2) # last two words
 349             f[head, '-STOP', 'RN'] += (not i_h &gt;= n - 2)
 350             f[head,  'STOP', 'RA'] += (i_h == n - 1) # very last word
 351             f[head, '-STOP', 'RA'] += (not i_h == n - 1)
 352 </pre>
 353 vs
 354 <pre>
 355             # this one requires some additional rewriting since it
 356             # introduces divisions by zero
 357             f[head,  'STOP', 'LN'] += (i_h == 1)     # second word
 358             f[head, '-STOP', 'LN'] += (not i_h &lt;= 1) # not first two
 359             f[head,  'STOP', 'LA'] += (i_h == 0)     # first word
 360             f[head, '-STOP', 'LA'] += (not i_h == 0) # not first
 361             f[head,  'STOP', 'RN'] += (i_h == n - 2)     # second-to-last
 362             f[head, '-STOP', 'RN'] += (not i_h &gt;= n - 2) # not last two
 363             f[head,  'STOP', 'RA'] += (i_h == n - 1)     # last word
 364             f[head, '-STOP', 'RA'] += (not i_h == n - 1) # not last
 365 </pre>
 366 vs
 367 <pre>
 368             f[head,  'STOP', 'LN'] += (i_h == 1)     # second word
 369             f[head, '-STOP', 'LN'] += (not i_h == 1) # not second
 370             f[head,  'STOP', 'LA'] += (i_h == 0)     # first word
 371             f[head, '-STOP', 'LA'] += (not i_h == 0) # not first
 372             f[head,  'STOP', 'RN'] += (i_h == n - 2)     # second-to-last
 373             f[head, '-STOP', 'RN'] += (not i_h == n - 2) # not second-to-last
 374             f[head,  'STOP', 'RA'] += (i_h == n - 1)     # last word
 375             f[head, '-STOP', 'RA'] += (not i_h == n - 1) # not last
 376 </pre>
 377 vs
 378 "all words take the same number of arguments" interpreted as
 379 <pre>
 380 for all heads:
 381     p_STOP(head, 'STOP', 'LN') = 0.3
 382     p_STOP(head, 'STOP', 'LA') = 0.5
 383     p_STOP(head, 'STOP', 'RN') = 0.4
 384     p_STOP(head, 'STOP', 'RA') = 0.7
 385 </pre>
 386 (which we easily may tweak in init_zeros())
 387 </li>
 388 </ul>
 389 </li>
 390 <li id="sec-7.4.2">P_CHOOSE<br/>
 391 Go through the corpus, counting distances between heads and
 392 arguments. In [det nn vb nn], we give
 393 <ul>
 394 <li>
 395 f<sub>CHOOSE</sub>(nn|det, R) +1/1 + C
 396 </li>
 397 <li>
 398 f<sub>CHOOSE</sub>(vb|det, R) +1/2 + C
 399 </li>
 400 <li>
 401 f<sub>CHOOSE</sub>(nn|det, R) +1/3 + C
 402 <ul>
 403 <li>
 404 If this were the full corpus, P<sub>CHOOSE</sub>(nn|det, R) would have
 405 (1+1/3+2C) / sum_a f<sub>CHOOSE</sub>(a|det, R)
 406
 407 </li>
 408 </ul>
 409 </li>
 410 </ul>
 411
 412 <p>The ROOT gets "each argument with equal probability", so in a sentence
 413 of three words, 1/3 for each (in [nn vb nn], 'nn' gets 2/3). Basically
 414 just a frequency count of the corpus&hellip;
 415 </p></li>
 416 </ul>
 417 </li>
 418 </ul>
 419 </div>
 420
 421 </div>
 422
 423 <div id="outline-container-8" class="outline-2">
 424 <h2 id="sec-8">8 [#C] Deferred</h2>
 425 <div id="text-8">
 426
 427 <p><a href="http://wiki.python.org/moin/PythonSpeed/PerformanceTips">http://wiki.python.org/moin/PythonSpeed/PerformanceTips</a> Eg., use
 428 map/reduce/filter/[i for i in [i's]]/(i for i in [i's]) instead of
 429 for-loops; use local variables for globals (global variables or or
 430 functions), etc.
 431 </p>
 432 <ul>
 433 <li id="sec-8.1"><span class="todo">TODO</span> when reestimating P_STOP etc, remove rules with p &lt; epsilon   &nbsp;&nbsp;&nbsp;<span class="tag">OPTIMIZE</span><br/>
 434 </li>
 435 <li id="sec-8.2"><span class="todo">TODO</span> inner_dmv, short ranges and impossible attachment             &nbsp;&nbsp;&nbsp;<span class="tag">OPTIMIZE</span><br/>
 436 If s-t &lt;= 2, there can be only one attachment below, so don't recurse
 437 with both Lattach=True and Rattach=True.
 438
 439 <p>
 440 If s-t &lt;= 1, there can be no attachment below, so only recurse with
 441 Lattach=False, Rattach=False.
 442 </p>
 443 <p>
 444 Put this in the loop under rewrite rules (could also do it in the STOP
 445 section, but that would only have an effect on very short sentences).
 446 </p></li>
 447 <li id="sec-8.3"><span class="todo">TODO</span> clean up the module files                                     &nbsp;&nbsp;&nbsp;<span class="tag">PRETTIER</span><br/>
 448 Is there better way to divide dmv and harmonic? There's a two-way
 449 dependency between the modules. Guess there could be a third file that
 450 imports both the initialization and the actual EM stuff, while a file
 451 containing constants and classes could be imported by all others:
 452 <pre>
 453  dmv.py imports dmv_EM.py imports dmv_classes.py
 454  dmv.py imports dmv_inits.py imports dmv_classes.py
 455 </pre>
 456
 457 </li>
 458 <li id="sec-8.4"><span class="todo">TOGROK</span> Some (tagged) sentences are bound to come twice             &nbsp;&nbsp;&nbsp;<span class="tag">OPTIMIZE</span><br/>
 459 Eg, first sort and count, so that the corpus
 460 [['nn','vbd','det','nn'],
 461 ['vbd','nn','det','nn'],
 462 ['nn','vbd','det','nn']]
 463 becomes
 464 [(['nn','vbd','det','nn'],2),
 465 (['vbd','nn','det','nn'],1)]
 466 and then in each loop through sentences, make sure we handle the
 467 frequency correctly.
 468
 469 <p>
 470 Is there much to gain here?
 471 </p>
 472 </li>
 473 <li id="sec-8.5"><span class="todo">TOGROK</span> tags as numbers or tags as strings?                         &nbsp;&nbsp;&nbsp;<span class="tag">OPTIMIZE</span><br/>
 474 Need to clean up the representation.
 475
 476 <p>
 477 Stick with tag-strings in initialization then switch to numbers for
 478 IO-algorithm perhaps? Can probably afford more string-matching in
 479 initialization..
 480 </p></li>
 481 <li id="sec-8.6"><span class="done">DONE</span> if loc_h == t, no need to try right-attachment rules &amp;v.v.    &nbsp;&nbsp;&nbsp;<span class="tag">OPTIMIZE</span><br/>
 482 <span class="timestamp-kwd">CLOSED: </span> <span class="timestamp">2008-06-10 Tue 14:34</span><br/>
 483 (and if loc_h == s, no need to try left-attachment rules.)
 484
 485 <p>
 486 Modest speed increase (5%).
 487 </p></li>
 488 <li id="sec-8.7"><span class="done">DONE</span> io.debug parameters should not call functions                 &nbsp;&nbsp;&nbsp;<span class="tag">OPTIMIZE</span><br/>
 489 <span class="timestamp-kwd">CLOSED: </span> <span class="timestamp">2008-06-10 Tue 12:26</span><br/>
 490 Exchanged all io.debug(str,'level') calls with statements of the form:
 491 <pre>
 492 if 'level' in io.DEBUG:
 493     print str
 494 </pre>
 495
 496 <p>
 497 and got an almost threefold speed increase on inner().
 498 </p></li>
 499 <li id="sec-8.8"><span class="done">DONE</span> inner_dmv() should disregard rules with heads not in sent     &nbsp;&nbsp;&nbsp;<span class="tag">OPTIMIZE</span><br/>
 500 <span class="timestamp-kwd">CLOSED: </span> <span class="timestamp">2008-06-08 Sun 10:18</span><br/>
 501 If the sentence is "nn vbd det nn", we should not even look at rules
 502 where
 503 <pre>
 504  rule.head() not in "nn vbd det nn".split()
 505 </pre>
 506 This is ruled out by getting rules from g.rules(LHS, sent).
 507
 508 <p>
 509 Also, we optimize this further by saying we don't even recurse into
 510 attachment rules where
 511 <pre>
 512  rule.head() not in sent[ s :r+1]
 513  rule.head() not in sent[r+1:t+1]
 514 </pre>
 515 meaning, if we're looking at the span "vbd det", we only use
 516 attachment rules where both daughters are members of ['vbd','det']
 517 (although we don't (yet) care about removing rules that rewrite to the
 518 same tag if there are no duplicate tags in the span, etc., that would
 519 be a lot of trouble for little potential gain).
 520 </p></li>
 521 </ul>
 522 </div>
 523
 524 </div>
 525
 526 <div id="outline-container-9" class="outline-2">
 527 <h2 id="sec-9">9 Adjacency and combining it with inner()</h2>
 528 <div id="text-9">
 529
 530 <p>Each DMV_Rule now has both a probN and a probA, for
 531 adjacencies. inner() needs the correct one in each case.
 532 </p>
 533 <p>
 534 Adjacency gives a problem with duplicate words/tags, eg. in the
 535 sentence "a a b". If this has the dependency structure b-&gt;a<sub>0</sub>-&gt;a<sub>1</sub>,
 536 then b is non-adjacent to a<sub>0</sub> and should use probN (for the LRStop and
 537 the attachment of a<sub>0</sub>), while the other rules should all use
 538 probA. But within the e(0,2,b) we can't just say "oh, a has index 0
 539 so it's not adjacent to 2", since there's also an a at index 1, and
 540 there's also a dependency structure b-&gt;a<sub>1</sub>-&gt;a<sub>0</sub> for that. We want
 541 both. And in possibly much more complex versions.
 542 </p>
 543 <p>
 544 So now we call inner() for each location of a head, and on each
 545 terminal, loc_h must equal s (and t). In the recursive attachment
 546 calls, we use the locations (sentence indices) of words to the left or
 547 right of the head in calls to inner(). loc_h lets us check whether we
 548 need probN or probA.
 549 </p>
 550 <p>
 551 In each inner() call, loc_h is the location of the root of this
 552 dependency structure.
 553 </p><ul>
 554 <li id="sec-9.1">Possible alternate type of adjacency<br/>
 555 K&amp;M's adjacency is just whether or not an argument has been generated
 556 in the current direction yet. One could also make a stronger type of
 557 adjacency, where h and a are not adjacent if b is in between, eg. with
 558 the sentence "a b h" and the structure ((h-&gt;a), (a-&gt;b)), h is
 559 K&amp;M-adjacent to a, but not next to a, since b is in between. It's easy
 560 to check this type of adjacency in inner(), but it needs new rules for
 561 P_STOP reestimation.
 562 </li>
 563 </ul>
 564 </div>
 565
 566 </div>
 567
 568 <div id="outline-container-10" class="outline-2">
 569 <h2 id="sec-10">10 Expectation Maximation in IO/DMV-terms</h2>
 570 <div id="text-10">
 571
 572 <p>inner(s,t,LHS) calculates the expected number of trees headed by LHS
 573 from s to t (sentence positions). This uses the P_STOP and P_CHOOSE
 574 values, which have been conveniently distributed into CNF rules as
 575 probN and probA (non-adjacent and adjacent probabilites).
 576 </p>
 577 <p>
 578 When re-estimating, we use the expected values from inner() to get new
 579 values for P_STOP and P_CHOOSE. When we've re-estimated for the entire
 580 corpus, we distribute P_STOP and P_CHOOSE into the CNF rules again, so
 581 that in the next round we use new probN and probA to find
 582 inner-probabilites.
 583 </p>
 584 <p>
 585 The distribution of P_STOP and P_CHOOSE into CNF rules also happens in
 586 init_normalize() (here along with the creation of P_STOP and
 587 P_CHOOSE); P_STOP is used to create CNF rules where one branch of the
 588 rule is STOP, P_CHOOSE is used to create rules of the form
 589 <pre>
 590  h  -&gt; h  _a_
 591  h_ -&gt; h_ _a_
 592 </pre>
 593 </p>
 594 <p>
 595 Since "adjacency" is not captured in regular CNF rules, we need two
 596 probabilites for each rule, and inner() has to know when to use which.
 597 </p>
 598 <ul>
 599 <li id="sec-10.1"><span class="todo">TODO</span> Corpus access<br/>
 600 </li>
 601 <li id="sec-10.2"><span class="todo">TOGROK</span> sentences or rules as the "outer loop"?                     &nbsp;&nbsp;&nbsp;<span class="tag">OPTIMIZE</span><br/>
 602 In regard to the E/M-step, finding P<sub>STOP</sub>, P<sub>CHOOSE</sub>.
 603
 604
 605 </li>
 606 </ul>
 607 </div>
 608
 609 </div>
 610
 611 <div id="outline-container-11" class="outline-2">
 612 <h2 id="sec-11">11 Python-stuff</h2>
 613 <div id="text-11">
 614
 615 <p>Make those debug statements steal a bit less attention in emacs:
 616 <pre>
 617 (font-lock-add-keywords
 618  'python-mode                   ; not really regexp, a bit slow
 619  '(("^\\( *\\)\\(\\if +'.+' +in +io.DEBUG. *\\(
 620 \\1    .+$\\)+\\)" 2 font-lock-preprocessor-face t)))
 621 (font-lock-add-keywords
 622  'python-mode
 623  '(("\\&lt;\\(\\(io\\.\\)?debug(.+)\\)" 1 font-lock-preprocessor-face t)))
 624 </pre>
 625 </p>
 626 <ul>
 627 <li>
 628 <a href="src/pseudo.py">pseudo.py</a>
 629 </li>
 630 <li>
 631 <a href="http://nltk.org/doc/en/structured-programming.html">http://nltk.org/doc/en/structured-programming.html</a> recursive dynamic
 632 </li>
 633 <li>
 634 <a href="http://nltk.org/doc/en/advanced-parsing.html">http://nltk.org/doc/en/advanced-parsing.html</a>
 635 </li>
 636 <li>
 637 <a href="http://jaynes.colorado.edu/PythonIdioms.html">http://jaynes.colorado.edu/PythonIdioms.html</a>
 638
 639
 640
 641 </li>
 642 </ul>
 643 </div>
 644
 645 </div>
 646
 647 <div id="outline-container-12" class="outline-2">
 648 <h2 id="sec-12">12 Git</h2>
 649 <div id="text-12">
 650
 651 <p>Repository web page: <a href="http://repo.or.cz/w/dmvccm.git">http://repo.or.cz/w/dmvccm.git</a>
 652 </p>
 653 <p>
 654 Setting up a new project:
 655 <pre>
 656  git init
 657  git add .
 658  git commit -m "first release"
 659 </pre>
 660 </p>
 661 <p>
 662 Later on: (<code>-a</code> does <code>git rm</code> and <code>git add</code> automatically)
 663 <pre>
 664  git init
 665  git commit -a -m "some subsequent release"
 666 </pre>
 667 </p>
 668 <p>
 669 Then push stuff up to the remote server:
 670 <pre>
 671  git push git+ssh://username@repo.or.cz/srv/git/dmvccm.git master
 672 </pre>
 673 </p>
 674 <p>
 675 (<code>eval `ssh-agent`</code> and <code>ssh-add</code> to avoid having to type in keyphrase all
 676 the time)
 677 </p>
 678 <p>
 679 Make a copy of the (remote) master branch:
 680 <pre>
 681  git clone git://repo.or.cz/dmvccm.git
 682 </pre>
 683 </p>
 684 <p>
 685 Make and name a new branch in this folder
 686 <pre>
 687  git checkout -b mybranch
 688 </pre>
 689 </p>
 690 <p>
 691 To save changes in <code>mybranch</code>:
 692 <pre>
 693  git commit -a
 694 </pre>
 695 </p>
 696 <p>
 697 Go back to the master branch (uncommitted changes from <code>mybranch</code> are
 698 carried over):
 699 <pre>
 700  git checkout master
 701 </pre>
 702 </p>
 703 <p>
 704 Try out:
 705 <pre>
 706  git add --interactive
 707 </pre>
 708 </p>
 709 <p>
 710 Good tutorial:
 711 <a href="http://www-cs-students.stanford.edu/~blynn//gitmagic/">http://www-cs-students.stanford.edu/~blynn//gitmagic/</a>
 712 </p></div>
 713 </div>
 714 <div id="postamble"><p class="author"> Author: Kevin Brubeck Unhammer
 715 <a href="mailto:K.BrubeckUnhammer at student uva nl ">&lt;K.BrubeckUnhammer at student uva nl &gt;</a>
 716 </p>
 717 <p class="date"> Date: 2008/06/11 16:40:21</p>
 718 </div><p class="postamble">Skrive vha. emacs + <a href='http://orgmode.org/'>org-mode</a></p></body>
 719 </html>