DMVCCM.html

   1 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
   2                "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
   3 <html xmlns="http://www.w3.org/1999/xhtml"
   4 lang="en" xml:lang="en">
   5 <head>
   6 <title>DMV/CCM &ndash; todo-list / progress</title>
   7 <meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>
   8 <meta name="generator" content="Org-mode"/>
   9 <meta name="generated" content="2008/06/04 15:01:15"/>
  10 <meta name="author" content="Kevin Brubeck Unhammer"/>
  11 <link rel="stylesheet" type="text/css" href="http://www.student.uib.no/~kun041/org.css">
  12 </head><body>
  13 <h1 class="title">DMV/CCM &ndash; todo-list / progress</h1>
  14 #+OPTIONS: ^:{} H:4 toc:3
  15
  16
  17
  18 <div id="table-of-contents">
  19 <h2>Table of Contents</h2>
  20 <div id="text-table-of-contents">
  21 <ul>
  22 <li><a href="#sec-1">1 dmvccm report and project</a></li>
  23 <li><a href="#sec-2">2 Adjacency and combining it with inner()</a></li>
  24 <li><a href="#sec-3">3 [#A] P_STOP for IO/EM</a></li>
  25 <li><a href="#sec-4">4 P_CHOOSE for IO/EM</a></li>
  26 <li><a href="#sec-5">5 Initialization   </a></li>
  27 <li><a href="#sec-6">6 [#C] Deferred</a></li>
  28 <li><a href="#sec-7">7 Expectation Maximation in IO/DMV-terms</a></li>
  29 <li><a href="#sec-8">8 Python-stuff</a></li>
  30 </ul>
  31 </div>
  32 </div>
  33
  34 <div id="outline-container-1" class="outline-2">
  35 <h2 id="sec-1">1 dmvccm report and project</h2>
  36 <div id="text-1">
  37
  38 <p><span class="timestamp-kwd">DEADLINE: </span> <span class="timestamp">2008-06-30 Mon</span><br/>
  39 But absolute, extended, really-quite-dead-now deadline: August
  40 31&hellip;
  41 </p><ul>
  42 <li>
  43 <a href="src/dmv.py">dmv.py</a>
  44 </li>
  45 <li>
  46 <a href="src/io.py">io.py</a>
  47 </li>
  48 <li>
  49 <a href="src/harmonic.py">harmonic.py</a>
  50 </li>
  51 </ul>
  52 </div>
  53
  54 </div>
  55
  56 <div id="outline-container-2" class="outline-2">
  57 <h2 id="sec-2">2 <span class="todo">TODO</span> Adjacency and combining it with inner()</h2>
  58 <div id="text-2">
  59
  60 <p>Each DMV_Rule now has both a probN and a probA, for
  61 adjacencies. inner() needs the correct one in each case.
  62 </p>
  63 <p>
  64 Adjacency gives a problem with duplicate words/tags, eg. in the
  65 sentence "a a b". If this has the dependency structure b-&gt;a<sub>0</sub>-&gt;a<sub>1</sub>,
  66 then b is non-adjacent to a<sub>0</sub> and should use probN (for the LRStop and
  67 the attachment of a<sub>0</sub>), while the other rules should all use
  68 probA. But within the e(0,2,b) we can't just say "oh, a has index 0
  69 so it's not adjacent to 2", since there's also an a at index 1, and
  70 there's also a dependency structure b-&gt;a<sub>1</sub>-&gt;a<sub>0</sub> for that. We want
  71 both. And in possibly much more complex versions.
  72 </p>
  73 <p>
  74 Ideas:
  75 </p><ul>
  76 <li>
  77 I first thought of decorating the individual words/tags in a
  78 sentence with their indices, and perhaps just duplicating the
  79 relevant rules (one for each index of the duplicate tags). But this
  80 gives an explosion in attachment rules (although a contained
  81 explosion, within the rules used in a sentence; but most sentences
  82 will have at least two NN's so it will be a problem).
  83 </li>
  84 <li>
  85 Then, I had a <i>brilliant</i> idea. Just let e(), the helper function of
  86 inner(), parametrize for an extra pair of boolean values for whether
  87 or not we've attached anything to the left or right yet ("yet"
  88 meaning "below"). So now, e() has a chart of the form [s, t, LHS,
  89 Lattach, Rattach], and of course e(s,t,LHS) is the sum of the four
  90 possible values for (Lattach,Rattach). This makes e() lots more
  91 complex and DMV-specific though, so it's been rewritten in
  92 inner_dmv() in dmv.py.
  93 </li>
  94 <li id="sec-2.1"><span class="todo">TODO</span> document this adjacency stuff better<br/>
  95 </li>
  96 <li id="sec-2.2"><span class="todo">TODO</span> test and debug my brilliant idea<br/>
  97 </li>
  98 <li id="sec-2.3"><span class="done">DONE</span> implement my brilliant idea.<br/>
  99 <span class="timestamp-kwd">CLOSED: </span> <span class="timestamp">2008-06-01 Sun 17:19</span><br/>
 100 <a href="src/dmv.py">e(sti) in dmv.py</a>
 101
 102 </li>
 103 <li id="sec-2.4"><span class="done">DONE</span> [#A] test inner() on sentences with duplicate words<br/>
 104 Works with eg. the sentence "h h h"
 105
 106
 107 </li>
 108 </ul>
 109 </div>
 110
 111 </div>
 112
 113 <div id="outline-container-3" class="outline-2">
 114 <h2 id="sec-3">3 <span class="todo">TODO</span> [#A] P_STOP for IO/EM</h2>
 115 <div id="text-3">
 116
 117 <p><a href="src/dmv.py">dmv-P_STOP</a>
 118 Remember: The P<sub>STOP</sub> formula is upside-down (left-to-right also).
 119 (In the article..not the thesis)
 120 </p>
 121 <p>
 122 Remember: Initialization makes some "short-cut" rules, these will also
 123 have to be updated along with the other P<sub>STOP</sub> updates:
 124 </p><ul>
 125 <li>
 126 b[(NOBAR, n<sub>h</sub>), 'h'] = 1.0       # always
 127 </li>
 128 <li>
 129 b[(RBAR, n<sub>h</sub>), 'h'] = h_.probA  # h_ is RBAR stop rule
 130 </li>
 131 <li>
 132 b[(LRBAR, n<sub>h</sub>), 'h'] = h_.probA * _ h_.probA
 133
 134 </li>
 135 <li id="sec-3.1">How is the P_STOP formula different given other values for dir and adj?<br/>
 136 (Presumably, the P<sub>STOP</sub> formula where STOP is True is just the
 137 rule-probability of _ h_ -&gt; STOP h_ or h_ -&gt; h STOP, but how does
 138 adjacency fit in here?)
 139
 140 <p>
 141 (And P<sub>STOP</sub>(-STOP|&hellip;) = 1 - P<sub>STOP</sub>(STOP|&hellip;) )
 142 </p></li>
 143 </ul>
 144 </div>
 145
 146 </div>
 147
 148 <div id="outline-container-4" class="outline-2">
 149 <h2 id="sec-4">4 <span class="todo">TODO</span> P_CHOOSE for IO/EM</h2>
 150 <div id="text-4">
 151
 152 <p>Write the formulas! should be easy?
 153 </p></div>
 154
 155 </div>
 156
 157 <div id="outline-container-5" class="outline-2">
 158 <h2 id="sec-5">5 Initialization   </h2>
 159 <div id="text-5">
 160
 161 <p><a href="/Users/kiwibird/Documents/Skole/V08/Probability/dmvccm/src/dmv.py">dmv-inits</a>
 162 </p>
 163 <p>
 164 We do have to go through the corpus, since the probabilities are based
 165 on how far away in the sentence arguments are from their heads.
 166 </p><ul>
 167 <li id="sec-5.1"><span class="todo">TODO</span> Separate initialization to another file?                      &nbsp;&nbsp;&nbsp;<span class="tag">PRETTIER</span><br/>
 168 (It's rather messy.)
 169 </li>
 170 <li id="sec-5.2"><span class="todo">TOGROK</span> CCM Initialization    <br/>
 171 P<sub>SPLIT</sub> used here&hellip; how, again?
 172 </li>
 173 <li id="sec-5.3"><span class="done">DONE</span> DMV Initialization probabilities<br/>
 174 (from initialization frequency)
 175 </li>
 176 <li id="sec-5.4"><span class="done">DONE</span> DMV Initialization frequencies    <br/>
 177 <span class="timestamp-kwd">CLOSED: </span> <span class="timestamp">2008-05-27 Tue 20:04</span><br/>
 178 <ul>
 179 <li id="sec-5.4.1">P_STOP    <br/>
 180 P<sub>STOP</sub> is not well defined by K&amp;M. One possible interpretation given
 181 the sentence [det nn vb nn] is
 182 <pre>
 183  f_{STOP}( STOP|det, L, adj) +1
 184  f_{STOP}(-STOP|det, L, adj) +0
 185  f_{STOP}( STOP|det, L, non_adj) +1
 186  f_{STOP}(-STOP|det, L, non_adj) +0
 187  f_{STOP}( STOP|det, R, adj) +0
 188  f_{STOP}(-STOP|det, R, adj) +1
 189
 190  f_{STOP}( STOP|nn, L, adj) +0
 191  f_{STOP}(-STOP|nn, L, adj) +1
 192  f_{STOP}( STOP|nn, L, non_adj) +1  # since there's at least one to the left
 193  f_{STOP}(-STOP|nn, L, non_adj) +0
 194 </pre>
 195 <ul>
 196 <li id="sec-5.4.1.1"><span class="todo">TODO</span> tweak<br/>
 197 <pre>
 198             f[head,  'STOP', 'LN'] += (i_h &lt;= 1)     # first two words
 199             f[head, '-STOP', 'LN'] += (not i_h &lt;= 1)
 200             f[head,  'STOP', 'LA'] += (i_h == 0)     # very first word
 201             f[head, '-STOP', 'LA'] += (not i_h == 0)
 202             f[head,  'STOP', 'RN'] += (i_h &gt;= n - 2) # last two words
 203             f[head, '-STOP', 'RN'] += (not i_h &gt;= n - 2)
 204             f[head,  'STOP', 'RA'] += (i_h == n - 1) # very last word
 205             f[head, '-STOP', 'RA'] += (not i_h == n - 1)
 206 </pre>
 207 vs
 208 <pre>
 209             # this one requires some additional rewriting since it
 210             # introduces divisions by zero
 211             f[head,  'STOP', 'LN'] += (i_h == 1)     # second word
 212             f[head, '-STOP', 'LN'] += (not i_h &lt;= 1) # not first two
 213             f[head,  'STOP', 'LA'] += (i_h == 0)     # first word
 214             f[head, '-STOP', 'LA'] += (not i_h == 0) # not first
 215             f[head,  'STOP', 'RN'] += (i_h == n - 2)     # second-to-last
 216             f[head, '-STOP', 'RN'] += (not i_h &gt;= n - 2) # not last two
 217             f[head,  'STOP', 'RA'] += (i_h == n - 1)     # last word
 218             f[head, '-STOP', 'RA'] += (not i_h == n - 1) # not last
 219 </pre>
 220 vs
 221 <pre>
 222             f[head,  'STOP', 'LN'] += (i_h == 1)     # second word
 223             f[head, '-STOP', 'LN'] += (not i_h == 1) # not second
 224             f[head,  'STOP', 'LA'] += (i_h == 0)     # first word
 225             f[head, '-STOP', 'LA'] += (not i_h == 0) # not first
 226             f[head,  'STOP', 'RN'] += (i_h == n - 2)     # second-to-last
 227             f[head, '-STOP', 'RN'] += (not i_h == n - 2) # not second-to-last
 228             f[head,  'STOP', 'RA'] += (i_h == n - 1)     # last word
 229             f[head, '-STOP', 'RA'] += (not i_h == n - 1) # not last
 230 </pre>
 231 vs
 232 "all words take the same number of arguments" interpreted as
 233 <pre>
 234 for all heads:
 235     p_STOP(head, 'STOP', 'LN') = 0.3
 236     p_STOP(head, 'STOP', 'LA') = 0.5
 237     p_STOP(head, 'STOP', 'RN') = 0.4
 238     p_STOP(head, 'STOP', 'RA') = 0.7
 239 </pre>
 240 (which we easily may tweak in init_zeros())
 241 </li>
 242 </ul>
 243 </li>
 244 <li id="sec-5.4.2">P_CHOOSE<br/>
 245 Go through the corpus, counting distances between heads and
 246 arguments. In [det nn vb nn], we give
 247 <ul>
 248 <li>
 249 f<sub>CHOOSE</sub>(nn|det, R) +1/1 + C
 250 </li>
 251 <li>
 252 f<sub>CHOOSE</sub>(vb|det, R) +1/2 + C
 253 </li>
 254 <li>
 255 f<sub>CHOOSE</sub>(nn|det, R) +1/3 + C
 256 <ul>
 257 <li>
 258 If this were the full corpus, P<sub>CHOOSE</sub>(nn|det, R) would have
 259 (1+1/3+2C) / sum_a f<sub>CHOOSE</sub>(a|det, R)
 260
 261 </li>
 262 </ul>
 263 </li>
 264 </ul>
 265
 266 <p>The ROOT gets "each argument with equal probability", so in a sentence
 267 of three words, 1/3 for each (in [nn vb nn], 'nn' gets 2/3). Basically
 268 just a frequency count of the corpus&hellip;
 269 </p></li>
 270 </ul>
 271 </li>
 272 </ul>
 273 </div>
 274
 275 </div>
 276
 277 <div id="outline-container-6" class="outline-2">
 278 <h2 id="sec-6">6 [#C] Deferred</h2>
 279 <div id="text-6">
 280
 281 <ul>
 282 <li id="sec-6.1"><span class="todo">TODO</span> inner_dmv() should disregard rules with heads not in sent     &nbsp;&nbsp;&nbsp;<span class="tag">OPTIMIZE</span><br/>
 283 If the sentence is "nn vbd det nn", we should not even look at rules
 284 where
 285 <pre>
 286  rule.head() not in "nn vbd det nn".split()
 287 </pre>
 288 This is ruled out by getting rules from g.rules(LHS, sent).
 289
 290 <p>
 291 Also, we optimize this further by saying we don't even recurse into
 292 attachment rules where
 293 <pre>
 294  rule.head() not in sent[ s :r+1]
 295  rule.head() not in sent[r+1:t+1]
 296 </pre>
 297 meaning, if we're looking at the span "vbd det", we only use
 298 attachment rules where both daughters are members of ['vbd','det']
 299 (although we don't (yet) care about removing rules that rewrite to the
 300 same tag if there are no duplicate tags in the span, etc., that would
 301 be a lot of trouble for little potential gain).
 302 </p></li>
 303 <li id="sec-6.2"><span class="todo">TODO</span> when reestimating P_STOP etc, remove rules with p &lt; epsilon   &nbsp;&nbsp;&nbsp;<span class="tag">OPTIMIZE</span><br/>
 304 </li>
 305 <li id="sec-6.3"><span class="todo">TODO</span> inner_dmv, short ranges and impossible attachment             &nbsp;&nbsp;&nbsp;<span class="tag">OPTIMIZE</span><br/>
 306 If s-t &lt;= 2, there can be only one attachment below, so don't recurse
 307 with both Lattach=True and Rattach=True.
 308
 309 <p>
 310 If s-t &lt;= 1, there can be no attachment below, so only recurse with
 311 Lattach=False, Rattach=False.
 312 </p>
 313 <p>
 314 Put this in the loop under rewrite rules (could also do it in the STOP
 315 section, but that would only have an effect on very short sentences).
 316 </p></li>
 317 <li id="sec-6.4"><span class="todo">TODO</span> clean up the module files                                     &nbsp;&nbsp;&nbsp;<span class="tag">PRETTIER</span><br/>
 318 Is there better way to divide dmv and harmonic? There's a two-way
 319 dependency between the modules. Guess there could be a third file that
 320 imports both the initialization and the actual EM stuff, while a file
 321 containing constants and classes could be imported by all others:
 322 <pre>
 323  dmv.py imports dmv_EM.py imports dmv_classes.py
 324  dmv.py imports dmv_inits.py imports dmv_classes.py
 325 </pre>
 326
 327 </li>
 328 <li id="sec-6.5"><span class="todo">TOGROK</span> Some (tagged) sentences are bound to come twice             &nbsp;&nbsp;&nbsp;<span class="tag">OPTIMIZE</span><br/>
 329 Eg, first sort and count, so that the corpus
 330 [['nn','vbd','det','nn'],
 331 ['vbd','nn','det','nn'],
 332 ['nn','vbd','det','nn']]
 333 becomes
 334 [(['nn','vbd','det','nn'],2),
 335 (['vbd','nn','det','nn'],1)]
 336 and then in each loop through sentences, make sure we handle the
 337 frequency correctly.
 338
 339 <p>
 340 Is there much to gain here?
 341 </p>
 342 </li>
 343 <li id="sec-6.6"><span class="todo">TOGROK</span> tags as numbers or tags as strings?                         &nbsp;&nbsp;&nbsp;<span class="tag">OPTIMIZE</span><br/>
 344 Need to clean up the representation.
 345
 346 <p>
 347 Stick with tag-strings in initialization then switch to numbers for
 348 IO-algorithm perhaps? Can probably afford more string-matching in
 349 initialization..
 350 </p></li>
 351 </ul>
 352 </div>
 353
 354 </div>
 355
 356 <div id="outline-container-7" class="outline-2">
 357 <h2 id="sec-7">7 Expectation Maximation in IO/DMV-terms</h2>
 358 <div id="text-7">
 359
 360 <p>inner(s,t,LHS) calculates the expected number of trees headed by LHS
 361 from s to t (sentence positions). This uses the P_STOP and P_CHOOSE
 362 values, which have been conveniently distributed into CNF rules as
 363 probN and probA (non-adjacent and adjacent probabilites).
 364 </p>
 365 <p>
 366 When re-estimating, we use the expected values from inner() to get new
 367 values for P_STOP and P_CHOOSE. When we've re-estimated for the entire
 368 corpus, we distribute P_STOP and P_CHOOSE into the CNF rules again, so
 369 that in the next round we use new probN and probA to find
 370 inner-probabilites.
 371 </p>
 372 <p>
 373 The distribution of P_STOP and P_CHOOSE into CNF rules also happens in
 374 init_normalize() (here along with the creation of P_STOP and
 375 P_CHOOSE); P_STOP is used to create CNF rules where one branch of the
 376 rule is STOP, P_CHOOSE is used to create rules of the form
 377 <pre>
 378  h  -&gt; h  _a_
 379  h_ -&gt; h_ _a_
 380 </pre>
 381 </p>
 382 <p>
 383 Since "adjacency" is not captured in regular CNF rules, we need two
 384 probabilites for each rule, and inner() has to know when to use which.
 385 </p>
 386 <ul>
 387 <li id="sec-7.1"><span class="todo">TODO</span> Corpus access<br/>
 388 </li>
 389 <li id="sec-7.2"><span class="todo">TOGROK</span> sentences or rules as the "outer loop"?                     &nbsp;&nbsp;&nbsp;<span class="tag">OPTIMIZE</span><br/>
 390 In regard to the E/M-step, finding P<sub>STOP</sub>, P<sub>CHOOSE</sub>.
 391
 392
 393 </li>
 394 </ul>
 395 </div>
 396
 397 </div>
 398
 399 <div id="outline-container-8" class="outline-2">
 400 <h2 id="sec-8">8 Python-stuff</h2>
 401 <div id="text-8">
 402
 403 <ul>
 404 <li>
 405 <a href="src/pseudo.py">pseudo.py</a>
 406 </li>
 407 <li>
 408 <a href="http://nltk.org/doc/en/structured-programming.html">http://nltk.org/doc/en/structured-programming.html</a> recursive dynamic
 409 </li>
 410 <li>
 411 <a href="http://nltk.org/doc/en/advanced-parsing.html">http://nltk.org/doc/en/advanced-parsing.html</a>
 412 </li>
 413 <li>
 414 <a href="http://jaynes.colorado.edu/PythonIdioms.html">http://jaynes.colorado.edu/PythonIdioms.html</a>
 415
 416
 417 </li>
 418 </ul>
 419 </div>
 420 </div>
 421 <div id="postamble"><p class="author"> Author: Kevin Brubeck Unhammer
 422 <a href="mailto:K.BrubeckUnhammer at student uva nl ">&lt;K.BrubeckUnhammer at student uva nl &gt;</a>
 423 </p>
 424 <p class="date"> Date: 2008/06/04 15:01:15</p>
 425 </div><p class="postamble">Skrive vha. emacs + <a href='http://orgmode.org/'>org-mode</a></p></body>
 426 </html>