DMVCCM.html

   1 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
   2                "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
   3 <html xmlns="http://www.w3.org/1999/xhtml"
   4 lang="en" xml:lang="en">
   5 <head>
   6 <title>DMV/CCM &ndash; todo-list / progress</title>
   7 <meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>
   8 <meta name="generator" content="Org-mode"/>
   9 <meta name="generated" content="2008/06/15 17:59:26"/>
  10 <meta name="author" content="Kevin Brubeck Unhammer"/>
  11 <link rel="stylesheet" type="text/css" href="http://www.student.uib.no/~kun041/org.css">
  12 </head><body>
  13 <h1 class="title">DMV/CCM &ndash; todo-list / progress</h1>
  14
  15
  16 <div id="table-of-contents">
  17 <h2>Table of Contents</h2>
  18 <div id="text-table-of-contents">
  19 <ul>
  20 <li><a href="#sec-1">1 dmvccm report and project</a></li>
  21 <li><a href="#sec-2">2 outer probabilities</a></li>
  22 <li><a href="#sec-3">3 [#B] P_STOP and P_CHOOSE for IO/EM (reestimation)</a></li>
  23 <li><a href="#sec-4">4 Find Most Probable Parse of given test sentence</a></li>
  24 <li><a href="#sec-5">5 Combine CCM with DMV</a></li>
  25 <li><a href="#sec-6">6 Get a dependency parsed version of WSJ to test with</a></li>
  26 <li><a href="#sec-7">7 Initialization   </a></li>
  27 <li><a href="#sec-8">8 [#C] Deferred</a></li>
  28 <li><a href="#sec-9">9 Adjacency and combining it with the inside-outside algorithm</a></li>
  29 <li><a href="#sec-10">10 Expectation Maximation in IO/DMV-terms</a></li>
  30 <li><a href="#sec-11">11 Python-stuff</a></li>
  31 <li><a href="#sec-12">12 Git</a></li>
  32 </ul>
  33 </div>
  34 </div>
  35
  36 <div id="outline-container-1" class="outline-2">
  37 <h2 id="sec-1">1 dmvccm report and project</h2>
  38 <div id="text-1">
  39
  40 <p><span class="timestamp-kwd">DEADLINE: </span> <span class="timestamp">2008-06-30 Mon</span><br/>
  41 </p><ul>
  42 <li>
  43 <a href="src/dmv.py">dmv.py</a>
  44 </li>
  45 <li>
  46 <a href="src/io.py">io.py</a>
  47 </li>
  48 <li>
  49 <a href="src/harmonic.py">harmonic.py</a>
  50
  51 </li>
  52 </ul>
  53
  54 <p>(But absolute, extended, really-quite-dead-now deadline: August 31&hellip;)
  55 </p></div>
  56
  57 </div>
  58
  59 <div id="outline-container-2" class="outline-2">
  60 <h2 id="sec-2">2 <span class="done">DONE</span> outer probabilities</h2>
  61 <div id="text-2">
  62
  63 <p><span class="timestamp-kwd">CLOSED: </span> <span class="timestamp">2008-06-12 Thu 11:11</span><br/>
  64 <a name="outer">&nbsp;</a>
  65 There are 6 different configurations here, based on what the above
  66 (mother) node is, and what the Node for which we're computing is.
  67 Here <b>r</b> is not between <b>s</b> and <b>t</b> as in inner(), but an <i>outer</i> index. <b>loc_N</b>
  68 is the location of the Node head in the sentence, <b>loc_m</b> for the head
  69 of the mother of Node.
  70 </p>
  71 <ul>
  72 <li>
  73 mother is a RIGHT-stop:
  74 <ul>
  75 <li>
  76 outer(<b>s, t</b>, mom.LHS, <b>loc_N</b>), no inner-call
  77 </li>
  78 <li>
  79 adjacent iff <b>t</b> == <b>loc_m</b>
  80 </li>
  81 </ul>
  82 </li>
  83 <li>
  84 mother is a  LEFT-stop:
  85 <ul>
  86 <li>
  87 outer(<b>s, t</b>, mom.LHS, <b>loc_N</b>), no inner-call
  88 </li>
  89 <li>
  90 adjacent iff <b>s</b> == <b>loc_m</b>
  91
  92 </li>
  93 </ul>
  94 </li>
  95 <li>
  96 Node is on the LEFT branch (mom.L == Node)
  97 <ul>
  98 <li>
  99 and mother is a LEFT attachment:
 100 <ul>
 101 <li>
 102 <b>loc_N</b> will be in the LEFT branch, can be anything here.
 103 </li>
 104 <li>
 105 In the RIGHT, non-attached, branch we find inner(<b>t+1, r</b>, mom.R,
 106 <b>loc_m</b>) for all possible <b>loc_m</b> in the right part of the sentence.
 107 </li>
 108 <li>
 109 outer(<b>s, r</b>, mom.LHS, <b>loc_m</b>).
 110 </li>
 111 <li>
 112 adjacent iff <b>t+1</b> == <b>loc_m</b>
 113 </li>
 114 </ul>
 115 </li>
 116 <li>
 117 and mother is a RIGHT attachment:
 118 <ul>
 119 <li>
 120 <b>loc_m</b> = <b>loc_N</b>.
 121 </li>
 122 <li>
 123 In the RIGHT, attached, branch we find inner(<b>t+1, r</b>, mom.R, <b>loc_R</b>) for
 124 all possible <b>loc_R</b> in the right part of the sentence.
 125 </li>
 126 <li>
 127 outer(<b>s, r</b>, mom.LHS, <b>loc_N</b>).
 128 </li>
 129 <li>
 130 adjacent iff <b>t</b> == <b>loc_m</b>
 131
 132 </li>
 133 </ul>
 134 </li>
 135 </ul>
 136 </li>
 137 <li>
 138 Node is on the RIGHT branch (mom.R == Node)
 139 <ul>
 140 <li>
 141 and mother is a LEFT attachment:
 142 <ul>
 143 <li>
 144 <b>loc_m</b> = <b>loc_N</b>.
 145 </li>
 146 <li>
 147 In the LEFT, attached, branch we find inner(<b>r, s-1</b>, mom.L, <b>loc_L</b>) for
 148 all possible <b>loc_L</b> in the left part of the sentence.
 149 </li>
 150 <li>
 151 outer(<b>r, t</b>, mom.LHS, <b>loc_m</b>).
 152 </li>
 153 <li>
 154 adjacent iff <b>s</b> == <b>loc_m</b>
 155 </li>
 156 </ul>
 157 </li>
 158 <li>
 159 and mother is a RIGHT attachment:
 160 <ul>
 161 <li>
 162 <b>loc_N</b> will be in the RIGHT branch, can be anything here.
 163 </li>
 164 <li>
 165 In the LEFT, non-attached, branch we find inner(<b>r, s-1</b>, mom.L, <b>loc_m</b>) for
 166 all possible <b>loc_m</b> in the left part of the sentence.
 167 </li>
 168 <li>
 169 outer(<b>r, t</b>, mom.LHS, <b>loc_N</b>).
 170 </li>
 171 <li>
 172 adjacent iff <b>s-1</b> == <b>loc_m</b>
 173
 174 </li>
 175 </ul>
 176 </li>
 177 </ul>
 178 </li>
 179 </ul>
 180
 181 <p><img src="outer_attachments.jpg"/>
 182 </p>
 183 <p>
 184 <pre>
 185  in notes:   in code (constants):    in Klein thesis:
 186 -------------------------------------------------------------------
 187  _h_         SEAL                    bar over h
 188   h_         RGO_L                   right-under-left-arrow over h
 189   h          GO_R                    right-arrow over h
 190
 191              LGO_R                   left-under-right-arrow over h
 192              GO_L                    left-arrow over h
 193 </pre>
 194 </p>
 195 <p>
 196 Also, unlike in <a href="http://bibsonomy.org/bibtex/2b9f6798bb092697da7042ca3f5dee795">Lari & Young</a>, non-ROOT ('S') symbols may cover the
 197 whole sentence, but ROOT may <i>only</i> appear if it covers the whole
 198 sentence.
 199 </p>
 200 </div>
 201
 202 </div>
 203
 204 <div id="outline-container-3" class="outline-2">
 205 <h2 id="sec-3">3 <span class="todo">TODO</span> [#B] P_STOP and P_CHOOSE for IO/EM (reestimation)</h2>
 206 <div id="text-3">
 207
 208 <p><a href="src/dmv.py">dmv-P_STOP</a>
 209 Remember: The P<sub>STOP</sub> formula is upside-down (left-to-right also).
 210 (In the article..not the <a href="http://www.eecs.berkeley.edu/~klein/papers/klein_thesis.pdf">thesis</a>)
 211 </p>
 212 <ul>
 213 <li id="sec-3.1">P_STOP formulas for various dir and adj:<br/>
 214 Assuming this:
 215
 216 <table border="2" cellspacing="0" cellpadding="6" rules="groups" frame="hsides">
 217 <col align="left"></col><col align="left"></col>
 218 <tbody>
 219 <tr><td>P<sub>STOP</sub>(STOP : h,L,non_adj) =</td><td>&sum;<sub>corpus</sub> &sum;<sub>s&lt;loc(h)</sub> &sum;<sub>t&gt;=loc(h)</sub> c(s,t,_h_, &hellip;)</td></tr>
 220 <tr><td></td><td>&sum;<sub>corpus</sub> &sum;<sub>s&lt;loc(h)</sub> &sum;<sub>t&gt;=loc(h)</sub> c(s,t,h_, &hellip;)</td></tr>
 221 <tr><td></td><td></td></tr>
 222 <tr><td>P<sub>STOP</sub>(STOP : h,L,adj) =</td><td>&sum;<sub>corpus</sub> &sum;<sub>s=loc(h)</sub> &sum;<sub>t&gt;=loc(h)</sub> c(s,t,_h_, &hellip;)</td></tr>
 223 <tr><td></td><td>&sum;<sub>corpus</sub> &sum;<sub>s=loc(h)</sub> &sum;<sub>t&gt;=loc(h)</sub> c(s,t,h_, &hellip;)</td></tr>
 224 <tr><td></td><td></td></tr>
 225 <tr><td>P<sub>STOP</sub>(STOP : h,R,non_adj) =</td><td>&sum;<sub>corpus</sub> &sum;<sub>s=loc(h)</sub> &sum;<sub>t&gt;loc(h)</sub> c(s,t,h_, &hellip;)</td></tr>
 226 <tr><td></td><td>&sum;<sub>corpus</sub> &sum;<sub>s=loc(h)</sub> &sum;<sub>t&gt;loc(h)</sub> c(s,t,h, &hellip;)</td></tr>
 227 <tr><td></td><td></td></tr>
 228 <tr><td>P<sub>STOP</sub>(STOP : h,R,adj) =</td><td>&sum;<sub>corpus</sub> &sum;<sub>s=loc(h)</sub> &sum;<sub>t=loc(h)</sub> c(s,t,h_, &hellip;)</td></tr>
 229 <tr><td></td><td>&sum;<sub>corpus</sub> &sum;<sub>s=loc(h)</sub> &sum;<sub>t=loc(h)</sub> c(s,t,h, &hellip;)</td></tr>
 230 </tbody>
 231 </table>
 232
 233
 234 <p>
 235 (And P<sub>STOP</sub>(-STOP|&hellip;) = 1 - P<sub>STOP</sub>(STOP|&hellip;) )
 236 </p></li>
 237 <li id="sec-3.2">P_CHOOSE formula:<br/>
 238 Assuming this:
 239
 240 <table border="2" cellspacing="0" cellpadding="6" rules="groups" frame="hsides">
 241 <col align="left"></col><col align="left"></col>
 242 <tbody>
 243 <tr><td>P<sub>CHOOSE</sub>(a : h,R) =</td><td>&sum;<sub>corpus</sub> &sum;<sub>s=loc(h)</sub> &sum;<sub>t &gt; loc(h)</sub> &sum;<sub>loc(h) &lt; r &lt;= t</sub> c(s, r-1, h_, &hellip;) * c(r,t,_a_,&hellip;)</td></tr>
 244 <tr><td></td><td>&sum;<sub>corpus</sub> &sum;<sub>s=loc(h)</sub> &sum;<sub>t &gt; loc(h)</sub> c(s,t,h, &hellip;)</td></tr>
 245 <tr><td></td><td></td></tr>
 246 <tr><td>P<sub>CHOOSE</sub>(a : h,L) =</td><td>&sum;<sub>corpus</sub> &sum;<sub>s&lt;loc(h)</sub> &sum;<sub>t&gt;=loc(h)</sub> &sum;<sub>r&lt;loc(h)</sub> c(s,r,_a_, &hellip;) * c(r+1, t, h_, &hellip;)</td></tr>
 247 <tr><td></td><td>&sum;<sub>corpus</sub> &sum;<sub>s&lt;loc(h)</sub> &sum;<sub>t&gt;=loc(h)</sub> c(s,t,h_,&hellip;)</td></tr>
 248 </tbody>
 249 </table>
 250
 251 t &gt;= loc(h) since there are many possibilites for right-attachments
 252 below, and each of them alone gives a lower probability (through
 253 multiplication) to the upper tree (so add them all)
 254
 255 <p>
 256 The reason we have to check <i>both</i> children of the attachments is that we
 257 have to make sure they are contiguous (otherwise we would have no way
 258 of ruling out eg. h_-&gt;_b_,_b_-&gt;b_-&gt;_a_, where h_ covers <b>s</b> and <b>t</b>,_b_ is
 259 from <b>s</b> to <b>x&lt;r</b> and _ a_ is from <b>s</b> to <b>r</b>).
 260 </p></li>
 261 <li id="sec-3.3"><span class="done">DONE</span> [#A] How do we only count from completed trees?<br/>
 262 <span class="timestamp-kwd">CLOSED: </span> <span class="timestamp">2008-06-13 Fri 11:40</span><br/>
 263 Use c(s,t,Node); inner * outer / P_sent
 264
 265 </li>
 266 <li id="sec-3.4"><span class="done">DONE</span> [#A] c(s,t,Node)<br/>
 267 <span class="timestamp-kwd">CLOSED: </span> <span class="timestamp">2008-06-13 Fri 11:38</span><br/>
 268 = inner * outer / P_sent
 269
 270 <p>
 271 implemented as inner * outer / inner_sent
 272 </p>
 273 </li>
 274 </ul>
 275 </div>
 276
 277 </div>
 278
 279 <div id="outline-container-4" class="outline-2">
 280 <h2 id="sec-4">4 <span class="todo">TOGROK</span> Find Most Probable Parse of given test sentence</h2>
 281 <div id="text-4">
 282
 283 <p>We could probably do it pretty easily from the chart:
 284 </p><ul>
 285 <li>
 286 for all loc_h, select the (0,len(sent),ROOT,loc_h) that has the highest p
 287 </li>
 288 <li>
 289 for stops, just keep going
 290 </li>
 291 <li>
 292 for rewrites, for loc_dep select the one that has the highest p
 293 </li>
 294 </ul>
 295 </div>
 296
 297 </div>
 298
 299 <div id="outline-container-5" class="outline-2">
 300 <h2 id="sec-5">5 <span class="todo">TOGROK</span> Combine CCM with DMV</h2>
 301 <div id="text-5">
 302
 303 <p>Page 108 (pdf: 124) of <a href="http://www.eecs.berkeley.edu/~klein/papers/klein_thesis.pdf">Klein's thesis</a> gives info about this.
 304 </p><ul>
 305 <li id="sec-5.1"><span class="todo">TODO</span> Make inner() and outer() also allow left-first attachment<br/>
 306 Using P<sub>ORDER</sub>(<i>left-first</i> | w) etc.
 307 </li>
 308 </ul>
 309 </div>
 310
 311 </div>
 312
 313 <div id="outline-container-6" class="outline-2">
 314 <h2 id="sec-6">6 <span class="todo">TODO</span> Get a dependency parsed version of WSJ to test with</h2>
 315 <div id="text-6">
 316
 317 </div>
 318
 319 </div>
 320
 321 <div id="outline-container-7" class="outline-2">
 322 <h2 id="sec-7">7 Initialization   </h2>
 323 <div id="text-7">
 324
 325 <p><a href="/Users/kiwibird/Documents/Skole/V08/Probability/dmvccm/src/dmv.py">dmv-inits</a>
 326 </p>
 327 <p>
 328 We go through the corpus, since the probabilities are based on how far
 329 away in the sentence arguments are from their heads.
 330 </p><ul>
 331 <li id="sec-7.1"><span class="todo">TOGROK</span> CCM Initialization    <br/>
 332 P<sub>SPLIT</sub> used here&hellip; how, again?
 333 </li>
 334 <li id="sec-7.2"><span class="done">DONE</span> Separate initialization to another file?                      &nbsp;&nbsp;&nbsp;<span class="tag">PRETTIER</span><br/>
 335 <span class="timestamp-kwd">CLOSED: </span> <span class="timestamp">2008-06-08 Sun 12:51</span><br/>
 336 <a href="src/harmonic.py">harmonic.py</a>
 337 </li>
 338 <li id="sec-7.3"><span class="done">DONE</span> DMV Initialization probabilities<br/>
 339 (from initialization frequency)
 340 </li>
 341 <li id="sec-7.4"><span class="done">DONE</span> DMV Initialization frequencies    <br/>
 342 <span class="timestamp-kwd">CLOSED: </span> <span class="timestamp">2008-05-27 Tue 20:04</span><br/>
 343 <ul>
 344 <li id="sec-7.4.1">P_STOP    <br/>
 345 P<sub>STOP</sub> is not well defined by K&amp;M. One possible interpretation given
 346 the sentence [det nn vb nn] is
 347 <pre>
 348  f_{STOP}( STOP|det, L, adj) +1
 349  f_{STOP}(-STOP|det, L, adj) +0
 350  f_{STOP}( STOP|det, L, non_adj) +1
 351  f_{STOP}(-STOP|det, L, non_adj) +0
 352  f_{STOP}( STOP|det, R, adj) +0
 353  f_{STOP}(-STOP|det, R, adj) +1
 354
 355  f_{STOP}( STOP|nn, L, adj) +0
 356  f_{STOP}(-STOP|nn, L, adj) +1
 357  f_{STOP}( STOP|nn, L, non_adj) +1  # since there's at least one to the left
 358  f_{STOP}(-STOP|nn, L, non_adj) +0
 359 </pre>
 360 <ul>
 361 <li id="sec-7.4.1.1"><span class="todo">TODO</span> tweak<br/>
 362 <pre>
 363             f[head,  'STOP', 'LN'] += (i_h &lt;= 1)     # first two words
 364             f[head, '-STOP', 'LN'] += (not i_h &lt;= 1)
 365             f[head,  'STOP', 'LA'] += (i_h == 0)     # very first word
 366             f[head, '-STOP', 'LA'] += (not i_h == 0)
 367             f[head,  'STOP', 'RN'] += (i_h &gt;= n - 2) # last two words
 368             f[head, '-STOP', 'RN'] += (not i_h &gt;= n - 2)
 369             f[head,  'STOP', 'RA'] += (i_h == n - 1) # very last word
 370             f[head, '-STOP', 'RA'] += (not i_h == n - 1)
 371 </pre>
 372 vs
 373 <pre>
 374             # this one requires some additional rewriting since it
 375             # introduces divisions by zero
 376             f[head,  'STOP', 'LN'] += (i_h == 1)     # second word
 377             f[head, '-STOP', 'LN'] += (not i_h &lt;= 1) # not first two
 378             f[head,  'STOP', 'LA'] += (i_h == 0)     # first word
 379             f[head, '-STOP', 'LA'] += (not i_h == 0) # not first
 380             f[head,  'STOP', 'RN'] += (i_h == n - 2)     # second-to-last
 381             f[head, '-STOP', 'RN'] += (not i_h &gt;= n - 2) # not last two
 382             f[head,  'STOP', 'RA'] += (i_h == n - 1)     # last word
 383             f[head, '-STOP', 'RA'] += (not i_h == n - 1) # not last
 384 </pre>
 385 vs
 386 <pre>
 387             f[head,  'STOP', 'LN'] += (i_h == 1)     # second word
 388             f[head, '-STOP', 'LN'] += (not i_h == 1) # not second
 389             f[head,  'STOP', 'LA'] += (i_h == 0)     # first word
 390             f[head, '-STOP', 'LA'] += (not i_h == 0) # not first
 391             f[head,  'STOP', 'RN'] += (i_h == n - 2)     # second-to-last
 392             f[head, '-STOP', 'RN'] += (not i_h == n - 2) # not second-to-last
 393             f[head,  'STOP', 'RA'] += (i_h == n - 1)     # last word
 394             f[head, '-STOP', 'RA'] += (not i_h == n - 1) # not last
 395 </pre>
 396 vs
 397 "all words take the same number of arguments" interpreted as
 398 <pre>
 399 for all heads:
 400     p_STOP(head, 'STOP', 'LN') = 0.3
 401     p_STOP(head, 'STOP', 'LA') = 0.5
 402     p_STOP(head, 'STOP', 'RN') = 0.4
 403     p_STOP(head, 'STOP', 'RA') = 0.7
 404 </pre>
 405 (which we easily may tweak in init_zeros())
 406 </li>
 407 </ul>
 408 </li>
 409 <li id="sec-7.4.2">P_CHOOSE<br/>
 410 Go through the corpus, counting distances between heads and
 411 arguments. In [det nn vb nn], we give
 412 <ul>
 413 <li>
 414 f<sub>CHOOSE</sub>(nn|det, R) +1/1 + C
 415 </li>
 416 <li>
 417 f<sub>CHOOSE</sub>(vb|det, R) +1/2 + C
 418 </li>
 419 <li>
 420 f<sub>CHOOSE</sub>(nn|det, R) +1/3 + C
 421 <ul>
 422 <li>
 423 If this were the full corpus, P<sub>CHOOSE</sub>(nn|det, R) would have
 424 (1+1/3+2C) / sum_a f<sub>CHOOSE</sub>(a|det, R)
 425
 426 </li>
 427 </ul>
 428 </li>
 429 </ul>
 430
 431 <p>The ROOT gets "each argument with equal probability", so in a sentence
 432 of three words, 1/3 for each (in [nn vb nn], 'nn' gets 2/3). Basically
 433 just a frequency count of the corpus&hellip;
 434 </p>
 435 <p>
 436 In a sense there are no terminal probabilities, since an <i>h</i> can only
 437 rewrite to an 'h' anyway (it's just a check for whether, at this
 438 location in the sentence, we have the right POS-tag).
 439 </p></li>
 440 </ul>
 441 </li>
 442 </ul>
 443 </div>
 444
 445 </div>
 446
 447 <div id="outline-container-8" class="outline-2">
 448 <h2 id="sec-8">8 [#C] Deferred</h2>
 449 <div id="text-8">
 450
 451 <p><a href="http://wiki.python.org/moin/PythonSpeed/PerformanceTips">http://wiki.python.org/moin/PythonSpeed/PerformanceTips</a> Eg., use
 452 map/reduce/filter/[i for i in [i's]]/(i for i in [i's]) instead of
 453 for-loops; use local variables for globals (global variables or or
 454 functions), etc.
 455 </p>
 456 <ul>
 457 <li id="sec-8.1"><span class="todo">TODO</span> when reestimating P_STOP etc, remove rules with p &lt; epsilon   &nbsp;&nbsp;&nbsp;<span class="tag">OPTIMIZE</span><br/>
 458 </li>
 459 <li id="sec-8.2"><span class="todo">TODO</span> inner_dmv, short ranges and impossible attachment             &nbsp;&nbsp;&nbsp;<span class="tag">OPTIMIZE</span><br/>
 460 If s-t &lt;= 2, there can be only one attachment below, so don't recurse
 461 with both Lattach=True and Rattach=True.
 462
 463 <p>
 464 If s-t &lt;= 1, there can be no attachment below, so only recurse with
 465 Lattach=False, Rattach=False.
 466 </p>
 467 <p>
 468 Put this in the loop under rewrite rules (could also do it in the STOP
 469 section, but that would only have an effect on very short sentences).
 470 </p></li>
 471 <li id="sec-8.3"><span class="todo">TODO</span> clean up the module files                                     &nbsp;&nbsp;&nbsp;<span class="tag">PRETTIER</span><br/>
 472 Is there better way to divide dmv and harmonic? There's a two-way
 473 dependency between the modules. Guess there could be a third file that
 474 imports both the initialization and the actual EM stuff, while a file
 475 containing constants and classes could be imported by all others:
 476 <pre>
 477  dmv.py imports dmv_EM.py imports dmv_classes.py
 478  dmv.py imports dmv_inits.py imports dmv_classes.py
 479 </pre>
 480
 481 </li>
 482 <li id="sec-8.4"><span class="todo">TOGROK</span> Some (tagged) sentences are bound to come twice             &nbsp;&nbsp;&nbsp;<span class="tag">OPTIMIZE</span><br/>
 483 Eg, first sort and count, so that the corpus
 484 [['nn','vbd','det','nn'],
 485 ['vbd','nn','det','nn'],
 486 ['nn','vbd','det','nn']]
 487 becomes
 488 [(['nn','vbd','det','nn'],2),
 489 (['vbd','nn','det','nn'],1)]
 490 and then in each loop through sentences, make sure we handle the
 491 frequency correctly.
 492
 493 <p>
 494 Is there much to gain here?
 495 </p>
 496 </li>
 497 <li id="sec-8.5"><span class="todo">TOGROK</span> tags as numbers or tags as strings?                         &nbsp;&nbsp;&nbsp;<span class="tag">OPTIMIZE</span><br/>
 498 Need to clean up the representation.
 499
 500 <p>
 501 Stick with tag-strings in initialization then switch to numbers for
 502 IO-algorithm perhaps? Can probably afford more string-matching in
 503 initialization..
 504 </p></li>
 505 <li id="sec-8.6"><span class="done">DONE</span> if loc_h == t, no need to try right-attachment rules &amp;v.v.    &nbsp;&nbsp;&nbsp;<span class="tag">OPTIMIZE</span><br/>
 506 <span class="timestamp-kwd">CLOSED: </span> <span class="timestamp">2008-06-10 Tue 14:34</span><br/>
 507 (and if loc_h == s, no need to try left-attachment rules.)
 508
 509 <p>
 510 Modest speed increase (5%).
 511 </p></li>
 512 <li id="sec-8.7"><span class="done">DONE</span> io.debug parameters should not call functions                 &nbsp;&nbsp;&nbsp;<span class="tag">OPTIMIZE</span><br/>
 513 <span class="timestamp-kwd">CLOSED: </span> <span class="timestamp">2008-06-10 Tue 12:26</span><br/>
 514 Exchanged all io.debug(str,'level') calls with statements of the form:
 515 <pre>
 516 if 'level' in io.DEBUG:
 517     print str
 518 </pre>
 519
 520 <p>
 521 and got an almost threefold speed increase on inner().
 522 </p></li>
 523 <li id="sec-8.8"><span class="done">DONE</span> inner_dmv() should disregard rules with heads not in sent     &nbsp;&nbsp;&nbsp;<span class="tag">OPTIMIZE</span><br/>
 524 <span class="timestamp-kwd">CLOSED: </span> <span class="timestamp">2008-06-08 Sun 10:18</span><br/>
 525 If the sentence is "nn vbd det nn", we should not even look at rules
 526 where
 527 <pre>
 528  rule.head() not in "nn vbd det nn".split()
 529 </pre>
 530 This is ruled out by getting rules from g.rules(LHS, sent).
 531
 532 <p>
 533 Also, we optimize this further by saying we don't even recurse into
 534 attachment rules where
 535 <pre>
 536  rule.head() not in sent[ s :r+1]
 537  rule.head() not in sent[r+1:t+1]
 538 </pre>
 539 meaning, if we're looking at the span "vbd det", we only use
 540 attachment rules where both daughters are members of ['vbd','det']
 541 (although we don't (yet) care about removing rules that rewrite to the
 542 same tag if there are no duplicate tags in the span, etc., that would
 543 be a lot of trouble for little potential gain).
 544 </p></li>
 545 </ul>
 546 </div>
 547
 548 </div>
 549
 550 <div id="outline-container-9" class="outline-2">
 551 <h2 id="sec-9">9 Adjacency and combining it with the inside-outside algorithm</h2>
 552 <div id="text-9">
 553
 554 <p>Each DMV_Rule has both a probN and a probA, for adjacencies. inner()
 555 and outer() needs the correct one in each case.
 556 </p>
 557 <p>
 558 In each inner() call, loc_h is the location of the head of this
 559 dependency structure. In each outer() call, it's the head of the <i>Node</i>,
 560 the structure we're looking outside of.
 561 </p>
 562 <p>
 563 We call inner() for each location of a head, and on each terminal,
 564 loc_h must equal s (and t). In the recursive attachment calls, we use
 565 the locations (sentence indices) of words to the left or right of the
 566 head in calls to inner(). <i>loc_h lets us check whether we need probN or probA</i>.
 567 </p><ul>
 568 <li id="sec-9.1">Possible alternate type of adjacency<br/>
 569 K&amp;M's adjacency is just whether or not an argument has been generated
 570 in the current direction yet. One could also make a stronger type of
 571 adjacency, where h and a are not adjacent if b is in between, eg. with
 572 the sentence "a b h" and the structure ((h-&gt;a), (a-&gt;b)), h is
 573 K&amp;M-adjacent to a, but not next to a, since b is in between. It's easy
 574 to check this type of adjacency in inner(), but it needs new rules for
 575 P_STOP reestimation.
 576 </li>
 577 </ul>
 578 </div>
 579
 580 </div>
 581
 582 <div id="outline-container-10" class="outline-2">
 583 <h2 id="sec-10">10 Expectation Maximation in IO/DMV-terms</h2>
 584 <div id="text-10">
 585
 586 <p>outer(s,t,Node) and inner(s,t,Node) calculates the expected number of
 587 trees (CNF-)headed by Node from <b>s</b> to <b>t</b> (sentence positions). This uses
 588 the P_STOP and P_CHOOSE values, which have been conveniently
 589 distributed into CNF rules as probN and probA (non-adjacent and
 590 adjacent probabilites).
 591 </p>
 592 <p>
 593 When re-estimating, we use the expected values from outer() and
 594 inner() to get new values for P_STOP and P_CHOOSE. When we've
 595 re-estimated for the entire corpus, we distribute P_STOP and P_CHOOSE
 596 into the CNF rules again, so that in the next round we use new probN
 597 and probA to find outer- and inner-probabilites.
 598 </p>
 599 <p>
 600 The distribution of P_STOP and P_CHOOSE into CNF rules also happens in
 601 init_normalize() (here along with the creation of P_STOP and
 602 P_CHOOSE); P_STOP is used to create CNF rules where one branch of the
 603 rule is STOP, P_CHOOSE is used to create rules of the form
 604 <pre>
 605  h  -&gt; h  _a_
 606  h_ -&gt; h_ _a_
 607 </pre>
 608 </p>
 609 <p>
 610 Since "adjacency" is not captured in regular CNF rules, we need two
 611 probabilites for each rule, and outer() and inner() have to know when
 612 to use which.
 613 </p>
 614
 615 </div>
 616
 617 </div>
 618
 619 <div id="outline-container-11" class="outline-2">
 620 <h2 id="sec-11">11 Python-stuff</h2>
 621 <div id="text-11">
 622
 623 <p>Make those debug statements steal a bit less attention in emacs:
 624 <pre>
 625 (font-lock-add-keywords
 626  'python-mode                   ; not really regexp, a bit slow
 627  '(("^\\( *\\)\\(\\if +'.+' +in +io.DEBUG. *\\(
 628 \\1    .+$\\)+\\)" 2 font-lock-preprocessor-face t)))
 629 (font-lock-add-keywords
 630  'python-mode
 631  '(("\\&lt;\\(\\(io\\.\\)?debug(.+)\\)" 1 font-lock-preprocessor-face t)))
 632 </pre>
 633 </p>
 634 <ul>
 635 <li>
 636 <a href="src/pseudo.py">pseudo.py</a>
 637 </li>
 638 <li>
 639 <a href="http://nltk.org/doc/en/structured-programming.html">http://nltk.org/doc/en/structured-programming.html</a> recursive dynamic
 640 </li>
 641 <li>
 642 <a href="http://nltk.org/doc/en/advanced-parsing.html">http://nltk.org/doc/en/advanced-parsing.html</a>
 643 </li>
 644 <li>
 645 <a href="http://jaynes.colorado.edu/PythonIdioms.html">http://jaynes.colorado.edu/PythonIdioms.html</a>
 646
 647
 648
 649 </li>
 650 </ul>
 651 </div>
 652
 653 </div>
 654
 655 <div id="outline-container-12" class="outline-2">
 656 <h2 id="sec-12">12 Git</h2>
 657 <div id="text-12">
 658
 659 <p>Repository web page: <a href="http://repo.or.cz/w/dmvccm.git">http://repo.or.cz/w/dmvccm.git</a>
 660 </p>
 661 <p>
 662 Setting up a new project:
 663 <pre>
 664  git init
 665  git add .
 666  git commit -m "first release"
 667 </pre>
 668 </p>
 669 <p>
 670 Later on: (<code>-a</code> does <code>git rm</code> and <code>git add</code> automatically)
 671 <pre>
 672  git init
 673  git commit -a -m "some subsequent release"
 674 </pre>
 675 </p>
 676 <p>
 677 Then push stuff up to the remote server:
 678 <pre>
 679  git push git+ssh://username@repo.or.cz/srv/git/dmvccm.git master
 680 </pre>
 681 </p>
 682 <p>
 683 (<code>eval `ssh-agent`</code> and <code>ssh-add</code> to avoid having to type in keyphrase all
 684 the time)
 685 </p>
 686 <p>
 687 Make a copy of the (remote) master branch:
 688 <pre>
 689  git clone git://repo.or.cz/dmvccm.git
 690 </pre>
 691 </p>
 692 <p>
 693 Make and name a new branch in this folder
 694 <pre>
 695  git checkout -b mybranch
 696 </pre>
 697 </p>
 698 <p>
 699 To save changes in <code>mybranch</code>:
 700 <pre>
 701  git commit -a
 702 </pre>
 703 </p>
 704 <p>
 705 Go back to the master branch (uncommitted changes from <code>mybranch</code> are
 706 carried over):
 707 <pre>
 708  git checkout master
 709 </pre>
 710 </p>
 711 <p>
 712 Try out:
 713 <pre>
 714  git add --interactive
 715 </pre>
 716 </p>
 717 <p>
 718 Good tutorial:
 719 <a href="http://www-cs-students.stanford.edu/~blynn//gitmagic/">http://www-cs-students.stanford.edu/~blynn//gitmagic/</a>
 720 </p></div>
 721 </div>
 722 <div id="postamble"><p class="author"> Author: Kevin Brubeck Unhammer
 723 <a href="mailto:K.BrubeckUnhammer at student uva nl ">&lt;K.BrubeckUnhammer at student uva nl &gt;</a>
 724 </p>
 725 <p class="date"> Date: 2008/06/15 17:59:26</p>
 726 </div><p class="postamble">Skrive vha. emacs + <a href='http://orgmode.org/'>org-mode</a></p></body>
 727 </html>