DMVCCM.html

   1 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
   2                "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
   3 <html xmlns="http://www.w3.org/1999/xhtml"
   4 lang="en" xml:lang="en">
   5 <head>
   6 <title>DMV/CCM &ndash; todo-list / progress</title>
   7 <meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>
   8 <meta name="generator" content="Org-mode"/>
   9 <meta name="generated" content="2008/06/06 11:55:47"/>
  10 <meta name="author" content="Kevin Brubeck Unhammer"/>
  11 <link rel="stylesheet" type="text/css" href="http://www.student.uib.no/~kun041/org.css">
  12 </head><body>
  13 <h1 class="title">DMV/CCM &ndash; todo-list / progress</h1>
  14
  15
  16 <div id="table-of-contents">
  17 <h2>Table of Contents</h2>
  18 <div id="text-table-of-contents">
  19 <ul>
  20 <li><a href="#sec-1">1 dmvccm report and project</a></li>
  21 <li><a href="#sec-2">2 Adjacency and combining it with inner()</a></li>
  22 <li><a href="#sec-3">3 [#A] P_STOP for IO/EM</a></li>
  23 <li><a href="#sec-4">4 P_CHOOSE for IO/EM</a></li>
  24 <li><a href="#sec-5">5 Initialization   </a></li>
  25 <li><a href="#sec-6">6 [#C] Deferred</a></li>
  26 <li><a href="#sec-7">7 Expectation Maximation in IO/DMV-terms</a></li>
  27 <li><a href="#sec-8">8 Python-stuff</a></li>
  28 <li><a href="#sec-9">9 Git</a></li>
  29 </ul>
  30 </div>
  31 </div>
  32
  33 <div id="outline-container-1" class="outline-2">
  34 <h2 id="sec-1">1 dmvccm report and project</h2>
  35 <div id="text-1">
  36
  37 <p><span class="timestamp-kwd">DEADLINE: </span> <span class="timestamp">2008-06-30 Mon</span><br/>
  38 But absolute, extended, really-quite-dead-now deadline: August
  39 31&hellip;
  40 </p><ul>
  41 <li>
  42 <a href="src/dmv.py">dmv.py</a>
  43 </li>
  44 <li>
  45 <a href="src/io.py">io.py</a>
  46 </li>
  47 <li>
  48 <a href="src/harmonic.py">harmonic.py</a>
  49 </li>
  50 </ul>
  51 </div>
  52
  53 </div>
  54
  55 <div id="outline-container-2" class="outline-2">
  56 <h2 id="sec-2">2 <span class="todo">TODO</span> Adjacency and combining it with inner()</h2>
  57 <div id="text-2">
  58
  59 <p>Each DMV_Rule now has both a probN and a probA, for
  60 adjacencies. inner() needs the correct one in each case.
  61 </p>
  62 <p>
  63 Adjacency gives a problem with duplicate words/tags, eg. in the
  64 sentence "a a b". If this has the dependency structure b-&gt;a<sub>0</sub>-&gt;a<sub>1</sub>,
  65 then b is non-adjacent to a<sub>0</sub> and should use probN (for the LRStop and
  66 the attachment of a<sub>0</sub>), while the other rules should all use
  67 probA. But within the e(0,2,b) we can't just say "oh, a has index 0
  68 so it's not adjacent to 2", since there's also an a at index 1, and
  69 there's also a dependency structure b-&gt;a<sub>1</sub>-&gt;a<sub>0</sub> for that. We want
  70 both. And in possibly much more complex versions.
  71 </p>
  72 <p>
  73 Ideas:
  74 </p><ul>
  75 <li>
  76 I first thought of decorating the individual words/tags in a
  77 sentence with their indices, and perhaps just duplicating the
  78 relevant rules (one for each index of the duplicate tags). But this
  79 gives an explosion in attachment rules (although a contained
  80 explosion, within the rules used in a sentence; but most sentences
  81 will have at least two NN's so it will be a problem).
  82 </li>
  83 <li>
  84 Then, I had a <i>brilliant</i> idea. Just let e(), the helper function of
  85 inner(), parametrize for an extra pair of boolean values for whether
  86 or not we've attached anything to the left or right yet ("yet"
  87 meaning "below"). So now, e() has a chart of the form [s, t, LHS,
  88 Lattach, Rattach], and of course e(s,t,LHS) is the sum of the four
  89 possible values for (Lattach,Rattach). This makes e() lots more
  90 complex and DMV-specific though, so it's been rewritten in
  91 inner_dmv() in dmv.py.
  92 </li>
  93 <li id="sec-2.1"><span class="todo">TODO</span> document this adjacency stuff better<br/>
  94 </li>
  95 <li id="sec-2.2"><span class="todo">TODO</span> test and debug my brilliant idea<br/>
  96 </li>
  97 <li id="sec-2.3"><span class="done">DONE</span> implement my brilliant idea.<br/>
  98 <span class="timestamp-kwd">CLOSED: </span> <span class="timestamp">2008-06-01 Sun 17:19</span><br/>
  99 <a href="src/dmv.py">e(sti) in dmv.py</a>
 100
 101 </li>
 102 <li id="sec-2.4"><span class="done">DONE</span> [#A] test inner() on sentences with duplicate words<br/>
 103 Works with eg. the sentence "h h h"
 104
 105
 106 </li>
 107 </ul>
 108 </div>
 109
 110 </div>
 111
 112 <div id="outline-container-3" class="outline-2">
 113 <h2 id="sec-3">3 <span class="todo">TODO</span> [#A] P_STOP for IO/EM</h2>
 114 <div id="text-3">
 115
 116 <p><a href="src/dmv.py">dmv-P_STOP</a>
 117 Remember: The P<sub>STOP</sub> formula is upside-down (left-to-right also).
 118 (In the article..not the thesis)
 119 </p>
 120 <p>
 121 Remember: Initialization makes some "short-cut" rules, these will also
 122 have to be updated along with the other P<sub>STOP</sub> updates:
 123 </p><ul>
 124 <li>
 125 b[(NOBAR, n<sub>h</sub>), 'h'] = 1.0       # always
 126 </li>
 127 <li>
 128 b[(RBAR, n<sub>h</sub>), 'h'] = h_.probA  # h_ is RBAR stop rule
 129 </li>
 130 <li>
 131 b[(LRBAR, n<sub>h</sub>), 'h'] = h_.probA * _ h_.probA
 132
 133 </li>
 134 <li id="sec-3.1">How is the P_STOP formula different given other values for dir and adj?<br/>
 135 Assuming this:
 136 <ul>
 137 <li>
 138 P<sub>STOP</sub>(STOP|h,L,non_adj) = &sum;<sub>corpus</sub> &sum;<sub>s&lt;loc(h)</sub> &sum;<sub>t</sub>
 139 inner(s,t,(LRBAR,h)&hellip;) / &sum;<sub>corpus</sub> &sum;<sub>s&lt;loc(h)</sub> &sum;<sub>t</sub> inner(s,t,(RBAR,h)&hellip;)
 140 </li>
 141 <li>
 142 P<sub>STOP</sub>(STOP|h,L,adj) = &sum;<sub>corpus</sub> &sum;<sub>s=loc(h)</sub> &sum;<sub>t</sub>
 143 inner(s,t,(LRBAR,h)&hellip;) / &sum;<sub>corpus</sub> &sum;<sub>s=loc(h)</sub> &sum;<sub>t</sub> inner(s,t,(RBAR,h)&hellip;)
 144 </li>
 145 <li>
 146 P<sub>STOP</sub>(STOP|h,R,non_adj) = &sum;<sub>corpus</sub> &sum;<sub>s</sub> &sum;<sub>t&gt;loc(h)</sub>
 147 inner(s,t,(LRBAR,h)&hellip;) / &sum;<sub>corpus</sub> &sum;<sub>s</sub> &sum;<sub>t&gt;loc(h)</sub> inner(s,t,(RBAR,h)&hellip;)
 148 </li>
 149 <li>
 150 P<sub>STOP</sub>(STOP|h,R,adj) = &sum;<sub>corpus</sub> &sum;<sub>s</sub> &sum;<sub>t=loc(h)</sub>
 151 inner(s,t,(LRBAR,h)&hellip;) / &sum;<sub>corpus</sub> &sum;<sub>s</sub> &sum;<sub>t=loc(h)</sub> inner(s,t,(RBAR,h)&hellip;)
 152
 153
 154
 155 </li>
 156 </ul>
 157
 158 <p>(And P<sub>STOP</sub>(-STOP|&hellip;) = 1 - P<sub>STOP</sub>(STOP|&hellip;) )
 159 </p></li>
 160 </ul>
 161 </div>
 162
 163 </div>
 164
 165 <div id="outline-container-4" class="outline-2">
 166 <h2 id="sec-4">4 <span class="todo">TODO</span> P_CHOOSE for IO/EM</h2>
 167 <div id="text-4">
 168
 169 <p>Write the formulas! should be easy?
 170 </p></div>
 171
 172 </div>
 173
 174 <div id="outline-container-5" class="outline-2">
 175 <h2 id="sec-5">5 Initialization   </h2>
 176 <div id="text-5">
 177
 178 <p><a href="/Users/kiwibird/Documents/Skole/V08/Probability/dmvccm/src/dmv.py">dmv-inits</a>
 179 </p>
 180 <p>
 181 We do have to go through the corpus, since the probabilities are based
 182 on how far away in the sentence arguments are from their heads.
 183 </p><ul>
 184 <li id="sec-5.1"><span class="todo">TODO</span> Separate initialization to another file?                      &nbsp;&nbsp;&nbsp;<span class="tag">PRETTIER</span><br/>
 185 (It's rather messy.)
 186 </li>
 187 <li id="sec-5.2"><span class="todo">TOGROK</span> CCM Initialization    <br/>
 188 P<sub>SPLIT</sub> used here&hellip; how, again?
 189 </li>
 190 <li id="sec-5.3"><span class="done">DONE</span> DMV Initialization probabilities<br/>
 191 (from initialization frequency)
 192 </li>
 193 <li id="sec-5.4"><span class="done">DONE</span> DMV Initialization frequencies    <br/>
 194 <span class="timestamp-kwd">CLOSED: </span> <span class="timestamp">2008-05-27 Tue 20:04</span><br/>
 195 <ul>
 196 <li id="sec-5.4.1">P_STOP    <br/>
 197 P<sub>STOP</sub> is not well defined by K&amp;M. One possible interpretation given
 198 the sentence [det nn vb nn] is
 199 <pre>
 200  f_{STOP}( STOP|det, L, adj) +1
 201  f_{STOP}(-STOP|det, L, adj) +0
 202  f_{STOP}( STOP|det, L, non_adj) +1
 203  f_{STOP}(-STOP|det, L, non_adj) +0
 204  f_{STOP}( STOP|det, R, adj) +0
 205  f_{STOP}(-STOP|det, R, adj) +1
 206
 207  f_{STOP}( STOP|nn, L, adj) +0
 208  f_{STOP}(-STOP|nn, L, adj) +1
 209  f_{STOP}( STOP|nn, L, non_adj) +1  # since there's at least one to the left
 210  f_{STOP}(-STOP|nn, L, non_adj) +0
 211 </pre>
 212 <ul>
 213 <li id="sec-5.4.1.1"><span class="todo">TODO</span> tweak<br/>
 214 <pre>
 215             f[head,  'STOP', 'LN'] += (i_h &lt;= 1)     # first two words
 216             f[head, '-STOP', 'LN'] += (not i_h &lt;= 1)
 217             f[head,  'STOP', 'LA'] += (i_h == 0)     # very first word
 218             f[head, '-STOP', 'LA'] += (not i_h == 0)
 219             f[head,  'STOP', 'RN'] += (i_h &gt;= n - 2) # last two words
 220             f[head, '-STOP', 'RN'] += (not i_h &gt;= n - 2)
 221             f[head,  'STOP', 'RA'] += (i_h == n - 1) # very last word
 222             f[head, '-STOP', 'RA'] += (not i_h == n - 1)
 223 </pre>
 224 vs
 225 <pre>
 226             # this one requires some additional rewriting since it
 227             # introduces divisions by zero
 228             f[head,  'STOP', 'LN'] += (i_h == 1)     # second word
 229             f[head, '-STOP', 'LN'] += (not i_h &lt;= 1) # not first two
 230             f[head,  'STOP', 'LA'] += (i_h == 0)     # first word
 231             f[head, '-STOP', 'LA'] += (not i_h == 0) # not first
 232             f[head,  'STOP', 'RN'] += (i_h == n - 2)     # second-to-last
 233             f[head, '-STOP', 'RN'] += (not i_h &gt;= n - 2) # not last two
 234             f[head,  'STOP', 'RA'] += (i_h == n - 1)     # last word
 235             f[head, '-STOP', 'RA'] += (not i_h == n - 1) # not last
 236 </pre>
 237 vs
 238 <pre>
 239             f[head,  'STOP', 'LN'] += (i_h == 1)     # second word
 240             f[head, '-STOP', 'LN'] += (not i_h == 1) # not second
 241             f[head,  'STOP', 'LA'] += (i_h == 0)     # first word
 242             f[head, '-STOP', 'LA'] += (not i_h == 0) # not first
 243             f[head,  'STOP', 'RN'] += (i_h == n - 2)     # second-to-last
 244             f[head, '-STOP', 'RN'] += (not i_h == n - 2) # not second-to-last
 245             f[head,  'STOP', 'RA'] += (i_h == n - 1)     # last word
 246             f[head, '-STOP', 'RA'] += (not i_h == n - 1) # not last
 247 </pre>
 248 vs
 249 "all words take the same number of arguments" interpreted as
 250 <pre>
 251 for all heads:
 252     p_STOP(head, 'STOP', 'LN') = 0.3
 253     p_STOP(head, 'STOP', 'LA') = 0.5
 254     p_STOP(head, 'STOP', 'RN') = 0.4
 255     p_STOP(head, 'STOP', 'RA') = 0.7
 256 </pre>
 257 (which we easily may tweak in init_zeros())
 258 </li>
 259 </ul>
 260 </li>
 261 <li id="sec-5.4.2">P_CHOOSE<br/>
 262 Go through the corpus, counting distances between heads and
 263 arguments. In [det nn vb nn], we give
 264 <ul>
 265 <li>
 266 f<sub>CHOOSE</sub>(nn|det, R) +1/1 + C
 267 </li>
 268 <li>
 269 f<sub>CHOOSE</sub>(vb|det, R) +1/2 + C
 270 </li>
 271 <li>
 272 f<sub>CHOOSE</sub>(nn|det, R) +1/3 + C
 273 <ul>
 274 <li>
 275 If this were the full corpus, P<sub>CHOOSE</sub>(nn|det, R) would have
 276 (1+1/3+2C) / sum_a f<sub>CHOOSE</sub>(a|det, R)
 277
 278 </li>
 279 </ul>
 280 </li>
 281 </ul>
 282
 283 <p>The ROOT gets "each argument with equal probability", so in a sentence
 284 of three words, 1/3 for each (in [nn vb nn], 'nn' gets 2/3). Basically
 285 just a frequency count of the corpus&hellip;
 286 </p></li>
 287 </ul>
 288 </li>
 289 </ul>
 290 </div>
 291
 292 </div>
 293
 294 <div id="outline-container-6" class="outline-2">
 295 <h2 id="sec-6">6 [#C] Deferred</h2>
 296 <div id="text-6">
 297
 298 <ul>
 299 <li id="sec-6.1"><span class="todo">TODO</span> inner_dmv() should disregard rules with heads not in sent     &nbsp;&nbsp;&nbsp;<span class="tag">OPTIMIZE</span><br/>
 300 If the sentence is "nn vbd det nn", we should not even look at rules
 301 where
 302 <pre>
 303  rule.head() not in "nn vbd det nn".split()
 304 </pre>
 305 This is ruled out by getting rules from g.rules(LHS, sent).
 306
 307 <p>
 308 Also, we optimize this further by saying we don't even recurse into
 309 attachment rules where
 310 <pre>
 311  rule.head() not in sent[ s :r+1]
 312  rule.head() not in sent[r+1:t+1]
 313 </pre>
 314 meaning, if we're looking at the span "vbd det", we only use
 315 attachment rules where both daughters are members of ['vbd','det']
 316 (although we don't (yet) care about removing rules that rewrite to the
 317 same tag if there are no duplicate tags in the span, etc., that would
 318 be a lot of trouble for little potential gain).
 319 </p></li>
 320 <li id="sec-6.2"><span class="todo">TODO</span> when reestimating P_STOP etc, remove rules with p &lt; epsilon   &nbsp;&nbsp;&nbsp;<span class="tag">OPTIMIZE</span><br/>
 321 </li>
 322 <li id="sec-6.3"><span class="todo">TODO</span> inner_dmv, short ranges and impossible attachment             &nbsp;&nbsp;&nbsp;<span class="tag">OPTIMIZE</span><br/>
 323 If s-t &lt;= 2, there can be only one attachment below, so don't recurse
 324 with both Lattach=True and Rattach=True.
 325
 326 <p>
 327 If s-t &lt;= 1, there can be no attachment below, so only recurse with
 328 Lattach=False, Rattach=False.
 329 </p>
 330 <p>
 331 Put this in the loop under rewrite rules (could also do it in the STOP
 332 section, but that would only have an effect on very short sentences).
 333 </p></li>
 334 <li id="sec-6.4"><span class="todo">TODO</span> clean up the module files                                     &nbsp;&nbsp;&nbsp;<span class="tag">PRETTIER</span><br/>
 335 Is there better way to divide dmv and harmonic? There's a two-way
 336 dependency between the modules. Guess there could be a third file that
 337 imports both the initialization and the actual EM stuff, while a file
 338 containing constants and classes could be imported by all others:
 339 <pre>
 340  dmv.py imports dmv_EM.py imports dmv_classes.py
 341  dmv.py imports dmv_inits.py imports dmv_classes.py
 342 </pre>
 343
 344 </li>
 345 <li id="sec-6.5"><span class="todo">TOGROK</span> Some (tagged) sentences are bound to come twice             &nbsp;&nbsp;&nbsp;<span class="tag">OPTIMIZE</span><br/>
 346 Eg, first sort and count, so that the corpus
 347 [['nn','vbd','det','nn'],
 348 ['vbd','nn','det','nn'],
 349 ['nn','vbd','det','nn']]
 350 becomes
 351 [(['nn','vbd','det','nn'],2),
 352 (['vbd','nn','det','nn'],1)]
 353 and then in each loop through sentences, make sure we handle the
 354 frequency correctly.
 355
 356 <p>
 357 Is there much to gain here?
 358 </p>
 359 </li>
 360 <li id="sec-6.6"><span class="todo">TOGROK</span> tags as numbers or tags as strings?                         &nbsp;&nbsp;&nbsp;<span class="tag">OPTIMIZE</span><br/>
 361 Need to clean up the representation.
 362
 363 <p>
 364 Stick with tag-strings in initialization then switch to numbers for
 365 IO-algorithm perhaps? Can probably afford more string-matching in
 366 initialization..
 367 </p></li>
 368 </ul>
 369 </div>
 370
 371 </div>
 372
 373 <div id="outline-container-7" class="outline-2">
 374 <h2 id="sec-7">7 Expectation Maximation in IO/DMV-terms</h2>
 375 <div id="text-7">
 376
 377 <p>inner(s,t,LHS) calculates the expected number of trees headed by LHS
 378 from s to t (sentence positions). This uses the P_STOP and P_CHOOSE
 379 values, which have been conveniently distributed into CNF rules as
 380 probN and probA (non-adjacent and adjacent probabilites).
 381 </p>
 382 <p>
 383 When re-estimating, we use the expected values from inner() to get new
 384 values for P_STOP and P_CHOOSE. When we've re-estimated for the entire
 385 corpus, we distribute P_STOP and P_CHOOSE into the CNF rules again, so
 386 that in the next round we use new probN and probA to find
 387 inner-probabilites.
 388 </p>
 389 <p>
 390 The distribution of P_STOP and P_CHOOSE into CNF rules also happens in
 391 init_normalize() (here along with the creation of P_STOP and
 392 P_CHOOSE); P_STOP is used to create CNF rules where one branch of the
 393 rule is STOP, P_CHOOSE is used to create rules of the form
 394 <pre>
 395  h  -&gt; h  _a_
 396  h_ -&gt; h_ _a_
 397 </pre>
 398 </p>
 399 <p>
 400 Since "adjacency" is not captured in regular CNF rules, we need two
 401 probabilites for each rule, and inner() has to know when to use which.
 402 </p>
 403 <ul>
 404 <li id="sec-7.1"><span class="todo">TODO</span> Corpus access<br/>
 405 </li>
 406 <li id="sec-7.2"><span class="todo">TOGROK</span> sentences or rules as the "outer loop"?                     &nbsp;&nbsp;&nbsp;<span class="tag">OPTIMIZE</span><br/>
 407 In regard to the E/M-step, finding P<sub>STOP</sub>, P<sub>CHOOSE</sub>.
 408
 409
 410 </li>
 411 </ul>
 412 </div>
 413
 414 </div>
 415
 416 <div id="outline-container-8" class="outline-2">
 417 <h2 id="sec-8">8 Python-stuff</h2>
 418 <div id="text-8">
 419
 420 <ul>
 421 <li>
 422 <a href="src/pseudo.py">pseudo.py</a>
 423 </li>
 424 <li>
 425 <a href="http://nltk.org/doc/en/structured-programming.html">http://nltk.org/doc/en/structured-programming.html</a> recursive dynamic
 426 </li>
 427 <li>
 428 <a href="http://nltk.org/doc/en/advanced-parsing.html">http://nltk.org/doc/en/advanced-parsing.html</a>
 429 </li>
 430 <li>
 431 <a href="http://jaynes.colorado.edu/PythonIdioms.html">http://jaynes.colorado.edu/PythonIdioms.html</a>
 432
 433
 434
 435 </li>
 436 </ul>
 437 </div>
 438
 439 </div>
 440
 441 <div id="outline-container-9" class="outline-2">
 442 <h2 id="sec-9">9 Git</h2>
 443 <div id="text-9">
 444
 445 <p>Setting up a new project:
 446 <pre>
 447  git init
 448  git add .
 449  git commit -m "first release"
 450 </pre>
 451 </p>
 452 <p>
 453 Later on: (<code>-a</code> does <code>git rm</code> and <code>git add</code> automatically)
 454 <pre>
 455  git init
 456  git commit -a -m "some subsequent release"
 457 </pre>
 458 </p>
 459 <p>
 460 Then push stuff up to the remote server:
 461 <pre>
 462  git push git+ssh://username@repo.or.cz/srv/git/dmvccm.git master
 463 </pre>
 464 </p>
 465 <p>
 466 (<code>eval `ssh-agent`</code> and <code>ssh-add</code> to avoid having to type in keyphrase all
 467 the time)
 468 </p>
 469 <p>
 470 Make a copy of the (remote) master branch:
 471 <pre>
 472  git clone git://repo.or.cz/dmvccm.git
 473 </pre>
 474 </p>
 475 <p>
 476 Make and name a new branch in this folder
 477 <pre>
 478  git checkout -b mybranch
 479 </pre>
 480 </p>
 481 <p>
 482 To save changes in <code>mybranch</code>:
 483 <pre>
 484  git commit -a
 485 </pre>
 486 </p>
 487 <p>
 488 Go back to the master branch (uncommitted changes from <code>mybranch</code> are
 489 carried over):
 490 <pre>
 491  git checkout master
 492 </pre>
 493 </p>
 494 <p>
 495 Try out:
 496 <pre>
 497  git add --interactive
 498 </pre>
 499 </p>
 500 <p>
 501 Good tutorial:
 502 <a href="http://www-cs-students.stanford.edu/~blynn//gitmagic/">http://www-cs-students.stanford.edu/~blynn//gitmagic/</a>
 503 </p></div>
 504 </div>
 505 <div id="postamble"><p class="author"> Author: Kevin Brubeck Unhammer
 506 <a href="mailto:K.BrubeckUnhammer at student uva nl ">&lt;K.BrubeckUnhammer at student uva nl &gt;</a>
 507 </p>
 508 <p class="date"> Date: 2008/06/06 11:55:47</p>
 509 </div><p class="postamble">Skrive vha. emacs + <a href='http://orgmode.org/'>org-mode</a></p></body>
 510 </html>