DMVCCM.html.~19~

   1 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
   2                "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
   3 <html xmlns="http://www.w3.org/1999/xhtml"
   4 lang="en" xml:lang="en">
   5 <head>
   6 <title>DMV/CCM</title>
   7 <meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>
   8 <meta name="generator" content="Org-mode"/>
   9 <meta name="generated" content="2008/06/03 14:04:08"/>
  10 <meta name="author" content="Kevin Brubeck Unhammer"/>
  11 <link rel="stylesheet" type="text/css" href="http://www.student.uib.no/~kun041/org.css">
  12 </head><body>
  13 <h1 class="title">DMV/CCM</h1>
  14 <div id="table-of-contents">
  15 <h2>Table of Contents</h2>
  16 <ul>
  17 <li><a href="#sec-1">1 dmvccm</a>
  18 <ul>
  19 <li><a href="#sec-2">1.1 Adjacency and combining it with inner()</a>
  20 <ul>
  21 <li><a href="#sec-3">1.1.1 document this adjacency stuff better</a></li>
  22 <li><a href="#sec-4">1.1.2 test and debug my brilliant idea</a></li>
  23 <li><a href="#sec-5">1.1.3 implement my brilliant idea.</a></li>
  24 <li><a href="#sec-6">1.1.4 [#A] test inner() on sentences with duplicate words</a></li>
  25 </ul>
  26 </li>
  27 <li><a href="#sec-7">1.2 [#A] P_STOP for IO/EM</a>
  28 <ul>
  29 <li><a href="#sec-8">1.2.1 How is the P_STOP formula different given other values for dir and adj?</a></li>
  30 </ul>
  31 </li>
  32 <li><a href="#sec-9">1.3 P_CHOOSE for IO/EM</a></li>
  33 <li><a href="#sec-10">1.4 Initialization   </a>
  34 <ul>
  35 <li><a href="#sec-11">1.4.1 Separate initialization to another file?</a></li>
  36 <li><a href="#sec-12">1.4.2 CCM Initialization    </a></li>
  37 <li><a href="#sec-13">1.4.3 DMV Initialization probabilities</a></li>
  38 <li><a href="#sec-14">1.4.4 DMV Initialization frequencies    </a></li>
  39 </ul>
  40 </li>
  41 <li><a href="#sec-17">1.5 [#C] Deferred</a>
  42 <ul>
  43 <li><a href="#sec-18">1.5.1 when reestimating P_STOP etc, remove rules with p &lt; epsilon</a></li>
  44 <li><a href="#sec-19">1.5.2 inner_dmv, short ranges and impossible attachment</a></li>
  45 <li><a href="#sec-20">1.5.3 Some (tagged) sentences are bound to come twice</a></li>
  46 <li><a href="#sec-21">1.5.4 tags as numbers or tags as strings?</a></li>
  47 </ul>
  48 </li>
  49 <li><a href="#sec-22">1.6 Expectation Maximation in IO/DMV-terms</a>
  50 <ul>
  51 <li><a href="#sec-23">1.6.1 Corpus access</a></li>
  52 <li><a href="#sec-24">1.6.2 sentences or rules as the "outer loop"?</a></li>
  53 </ul></li>
  54 </ul>
  55 </li>
  56 <li><a href="#sec-25">2 Python-stuff</a></li>
  57 </ul>
  58 </div>
  59
  60 <div class="outline-2">
  61 <h2 id="sec-1">1 dmvccm</h2>
  62
  63 <p><span class="timestamp-kwd">DEADLINE: </span> <span class="timestamp">2008-06-30 Mon</span><br/>
  64 (But absolute, extended, really-quite-dead-now deadline: August 31&hellip;)
  65 <a href="src/dmv.py">dmv.py</a>
  66 <a href="src/io.py">io.py</a>
  67 </p>
  68 <div class="outline-3">
  69 <h3 id="sec-2">1.1 <span class="todo">TODO</span> Adjacency and combining it with inner()</h3>
  70
  71 <p>Each DMV_Rule now has both a probN and a probA, for
  72 adjacencies. inner() needs the correct one in each case.
  73 </p>
  74 <p>
  75 Adjacency gives a problem with duplicate words/tags, eg. in the
  76 sentence "a a b". If this has the dependency structure b-&gt;a<sub>0</sub>-&gt;a<sub>1</sub>,
  77 then b is non-adjacent to a<sub>0</sub> and should use probN (for the LRStop and
  78 the attachment of a<sub>0</sub>), while the other rules should all use
  79 probA. But within the e(0,2,b) we can't just say "oh, a has index 0
  80 so it's not adjacent to 2", since there's also an a at index 1, and
  81 there's also a dependency structure b-&gt;a<sub>1</sub>-&gt;a<sub>0</sub> for that. We want
  82 both. And in possibly much more complex versions.
  83 </p>
  84 <p>
  85 Ideas:
  86 </p><ul>
  87 <li>
  88 I first thought of decorating the individual words/tags in a
  89 sentence with their indices, and perhaps just duplicating the
  90 relevant rules (one for each index of the duplicate tags). But this
  91 gives an explosion in attachment rules (although a contained
  92 explosion, within the rules used in a sentence; but most sentences
  93 will have at least two NN's so it will be a problem).
  94 </li>
  95 <li>
  96 Then, I had a <i>brilliant</i> idea. Just let e(), the helper function of
  97 inner(), parametrize for an extra pair of boolean values for whether
  98 or not we've attached anything to the left or right yet ("yet"
  99 meaning "below"). So now, e() has a chart of the form [s, t, LHS,
 100 Lattach, Rattach], and of course e(s,t,LHS) is the sum of the four
 101 possible values for (Lattach,Rattach). This makes e() lots more
 102 complex and DMV-specific though, so it's been rewritten in
 103 inner_dmv() in dmv.py.
 104 </li>
 105 </ul>
 106 <div class="outline-4">
 107 <h4 id="sec-3">1.1.1 <span class="todo">TODO</span> document this adjacency stuff better</h4>
 108
 109 </div>
 110
 111 <div class="outline-4">
 112 <h4 id="sec-4">1.1.2 <span class="todo">TODO</span> test and debug my brilliant idea</h4>
 113
 114 </div>
 115
 116 <div class="outline-4">
 117 <h4 id="sec-5">1.1.3 <span class="done">DONE</span> implement my brilliant idea.</h4>
 118
 119 <p><span class="timestamp-kwd">CLOSED: </span> <span class="timestamp">2008-06-01 Sun 17:19</span><br/>
 120 <a href="src/dmv.py">e(sti) in dmv.py</a>
 121 </p>
 122 </div>
 123
 124 <div class="outline-4">
 125 <h4 id="sec-6">1.1.4 <span class="done">DONE</span> [#A] test inner() on sentences with duplicate words</h4>
 126
 127 <p>Works with eg. the sentence "h h h"
 128 </p>
 129
 130 </div>
 131 </div>
 132
 133 <div class="outline-3">
 134 <h3 id="sec-7">1.2 <span class="todo">TODO</span> [#A] P_STOP for IO/EM</h3>
 135
 136 <p><a href="src/dmv.py">dmv-P_STOP</a>
 137 Remember: The P<sub>STOP</sub> formula is upside-down (left-to-right also).
 138 (In the article..not the thesis)
 139 </p>
 140 <p>
 141 Remember: Initialization makes some "short-cut" rules, these will also
 142 have to be updated along with the other P<sub>STOP</sub> updates:
 143 </p><ul>
 144 <li>
 145 b[(NOBAR, n<sub>h</sub>), 'h'] = 1.0       # always
 146 </li>
 147 <li>
 148 b[(RBAR, n<sub>h</sub>), 'h'] = h_.probA  # h_ is RBAR stop rule
 149 </li>
 150 <li>
 151 b[(LRBAR, n<sub>h</sub>), 'h'] = h_.probA * _ h_.probA
 152
 153 </li>
 154 </ul>
 155 <div class="outline-4">
 156 <h4 id="sec-8">1.2.1 How is the P_STOP formula different given other values for dir and adj?</h4>
 157
 158 <p>(Presumably, the P<sub>STOP</sub> formula where STOP is True is just the
 159 rule-probability of _ h_ -&gt; STOP h_ or h_ -&gt; h STOP, but how does
 160 adjacency fit in here?)
 161 </p>
 162 <p>
 163 (And P<sub>STOP</sub>(-STOP|&hellip;) = 1 - P<sub>STOP</sub>(STOP|&hellip;) )
 164 </p></div>
 165 </div>
 166
 167 <div class="outline-3">
 168 <h3 id="sec-9">1.3 <span class="todo">TODO</span> P_CHOOSE for IO/EM</h3>
 169
 170 <p>Write the formulas! should be easy?
 171 </p></div>
 172
 173 <div class="outline-3">
 174 <h3 id="sec-10">1.4 Initialization   </h3>
 175
 176 <p><a href="/Users/kiwibird/Documents/Skole/V08/Probability/dmvccm/src/dmv.py">dmv-inits</a>
 177 </p>
 178 <p>
 179 We do have to go through the corpus, since the probabilities are based
 180 on how far away in the sentence arguments are from their heads.
 181 </p>
 182 <div class="outline-4">
 183 <h4 id="sec-11">1.4.1 <span class="todo">TODO</span> Separate initialization to another file?                     &nbsp;&nbsp;&nbsp;<span class="tag">PRETTIER</span></h4>
 184
 185 <p>(It's rather messy.)
 186 </p></div>
 187
 188 <div class="outline-4">
 189 <h4 id="sec-12">1.4.2 <span class="todo">TOGROK</span> CCM Initialization    </h4>
 190
 191 <p>P<sub>SPLIT</sub> used here&hellip; how, again?
 192 </p></div>
 193
 194 <div class="outline-4">
 195 <h4 id="sec-13">1.4.3 <span class="done">DONE</span> DMV Initialization probabilities</h4>
 196
 197 <p>(from initialization frequency)
 198 </p></div>
 199
 200 <div class="outline-4">
 201 <h4 id="sec-14">1.4.4 <span class="done">DONE</span> DMV Initialization frequencies    </h4>
 202
 203 <p><span class="timestamp-kwd">CLOSED: </span> <span class="timestamp">2008-05-27 Tue 20:04</span><br/>
 204 </p>
 205 <div class="outline-5">
 206 <h5 id="sec-15">1.4.4.1 P_STOP    </h5>
 207
 208 <p>P<sub>STOP</sub> is not well defined by K&amp;M. One possible interpretation given
 209 the sentence [det nn vb nn] is
 210 <pre>
 211  f_{STOP}( STOP|det, L, adj) +1
 212  f_{STOP}(-STOP|det, L, adj) +0
 213  f_{STOP}( STOP|det, L, non_adj) +1
 214  f_{STOP}(-STOP|det, L, non_adj) +0
 215  f_{STOP}( STOP|det, R, adj) +0
 216  f_{STOP}(-STOP|det, R, adj) +1
 217
 218  f_{STOP}( STOP|nn, L, adj) +0
 219  f_{STOP}(-STOP|nn, L, adj) +1
 220  f_{STOP}( STOP|nn, L, non_adj) +1  # since there's at least one to the left
 221  f_{STOP}(-STOP|nn, L, non_adj) +0
 222 </pre>
 223 </p><ul>
 224 <li><span class="todo">TODO</span> tweak<br/>
 225 <a name="pstoptweak">&nbsp;</a>
 226 <pre>
 227             f[head,  'STOP', 'LN'] += (i_h &lt;= 1)     # first two words
 228             f[head, '-STOP', 'LN'] += (not i_h &lt;= 1)
 229             f[head,  'STOP', 'LA'] += (i_h == 0)     # very first word
 230             f[head, '-STOP', 'LA'] += (not i_h == 0)
 231             f[head,  'STOP', 'RN'] += (i_h &gt;= n - 2) # last two words
 232             f[head, '-STOP', 'RN'] += (not i_h &gt;= n - 2)
 233             f[head,  'STOP', 'RA'] += (i_h == n - 1) # very last word
 234             f[head, '-STOP', 'RA'] += (not i_h == n - 1)
 235 </pre>
 236 vs
 237 <pre>
 238             f[head,  'STOP', 'LN'] += (i_h == 1)     # second word
 239             f[head, '-STOP', 'LN'] += (not i_h &lt;= 1) # not first two
 240             f[head,  'STOP', 'LA'] += (i_h == 0)     # first word
 241             f[head, '-STOP', 'LA'] += (not i_h == 0) # not first
 242             f[head,  'STOP', 'RN'] += (i_h == n - 2)     # second-to-last
 243             f[head, '-STOP', 'RN'] += (not i_h &gt;= n - 2) # not last two
 244             f[head,  'STOP', 'RA'] += (i_h == n - 1)     # last word
 245             f[head, '-STOP', 'RA'] += (not i_h == n - 1) # not last
 246 </pre>
 247 vs
 248 <pre>
 249             f[head,  'STOP', 'LN'] += (i_h == 1)     # second word
 250             f[head, '-STOP', 'LN'] += (not i_h == 1) # not second
 251             f[head,  'STOP', 'LA'] += (i_h == 0)     # first word
 252             f[head, '-STOP', 'LA'] += (not i_h == 0) # not first
 253             f[head,  'STOP', 'RN'] += (i_h == n - 2)     # second-to-last
 254             f[head, '-STOP', 'RN'] += (not i_h == n - 2) # not second-to-last
 255             f[head,  'STOP', 'RA'] += (i_h == n - 1)     # last word
 256             f[head, '-STOP', 'RA'] += (not i_h == n - 1) # not last
 257 </pre>
 258 vs
 259 "all words take the same number of arguments" interpreted as
 260 <pre>
 261 for all heads:
 262     p_STOP(head, 'STOP', 'LN') = 0.3
 263     p_STOP(head, 'STOP', 'LA') = 0.5
 264     p_STOP(head, 'STOP', 'RN') = 0.4
 265     p_STOP(head, 'STOP', 'RA') = 0.7
 266 </pre>
 267
 268 </li>
 269 </ul>
 270 </div>
 271
 272 <div class="outline-5">
 273 <h5 id="sec-16">1.4.4.2 P_CHOOSE</h5>
 274
 275 <p>Go through the corpus, counting distances between heads and
 276 arguments. In [det nn vb nn], we give
 277 </p><ul>
 278 <li>
 279 f<sub>CHOOSE</sub>(nn|det, R) +1/1 + C
 280 </li>
 281 <li>
 282 f<sub>CHOOSE</sub>(vb|det, R) +1/2 + C
 283 </li>
 284 <li>
 285 f<sub>CHOOSE</sub>(nn|det, R) +1/3 + C
 286 <ul>
 287 <li>
 288 If this were the full corpus, P<sub>CHOOSE</sub>(nn|det, R) would have
 289 (1+1/3+2C) / sum_a f<sub>CHOOSE</sub>(a|det, R)
 290
 291 </li>
 292 </ul></li>
 293 </ul>
 294 <p>The ROOT gets "each argument with equal probability", so in a sentence
 295 of three words, 1/3 for each (in [nn vb nn], 'nn' gets 2/3). Basically
 296 just a frequency count of the corpus&hellip;
 297 </p></div>
 298 </div>
 299 </div>
 300
 301 <div class="outline-3">
 302 <h3 id="sec-17">1.5 [#C] Deferred</h3>
 303
 304
 305 <div class="outline-4">
 306 <h4 id="sec-18">1.5.1 <span class="todo">TODO</span> when reestimating P_STOP etc, remove rules with p &lt; epsilon  &nbsp;&nbsp;&nbsp;<span class="tag">OPTIMIZE</span></h4>
 307
 308 </div>
 309
 310 <div class="outline-4">
 311 <h4 id="sec-19">1.5.2 <span class="todo">TODO</span> inner_dmv, short ranges and impossible attachment            &nbsp;&nbsp;&nbsp;<span class="tag">OPTIMIZE</span></h4>
 312
 313 <p>If s-t &lt;= 2, there can be only one attachment below, so don't recurse
 314 with both Lattach=True and Rattach=True.
 315 </p>
 316 <p>
 317 If s-t &lt;= 1, there can be no attachment below, so only recurse with
 318 Lattach=False, Rattach=False.
 319 </p>
 320 <p>
 321 Put this in the loop under rewrite rules (could also do it in the STOP
 322 section, but that would only have an effect on very short sentences).
 323 </p></div>
 324
 325 <div class="outline-4">
 326 <h4 id="sec-20">1.5.3 <span class="todo">TOGROK</span> Some (tagged) sentences are bound to come twice            &nbsp;&nbsp;&nbsp;<span class="tag">OPTIMIZE</span></h4>
 327
 328 <p>Eg, first sort and count, so that the corpus
 329 [['nn','vbd','det','nn'],
 330 ['vbd','nn','det','nn'],
 331 ['nn','vbd','det','nn']]
 332 becomes
 333 [(['nn','vbd','det','nn'],2),
 334 (['vbd','nn','det','nn'],1)]
 335 and then in each loop through sentences, make sure we handle the
 336 frequency correctly.
 337 </p>
 338 <p>
 339 Is there much to gain here?
 340 </p>
 341 </div>
 342
 343 <div class="outline-4">
 344 <h4 id="sec-21">1.5.4 <span class="todo">TOGROK</span> tags as numbers or tags as strings?                        &nbsp;&nbsp;&nbsp;<span class="tag">OPTIMIZE</span></h4>
 345
 346 <p>Need to clean up the representation.
 347 </p>
 348 <p>
 349 Stick with tag-strings in initialization then switch to numbers for
 350 IO-algorithm perhaps? Can probably afford more string-matching in
 351 initialization..
 352 </p></div>
 353 </div>
 354
 355 <div class="outline-3">
 356 <h3 id="sec-22">1.6 Expectation Maximation in IO/DMV-terms</h3>
 357
 358 <p>inner(s,t,LHS) calculates the expected number of trees headed by LHS
 359 from s to t (sentence positions). This uses the P_STOP and P_CHOOSE
 360 values, which have been conveniently distributed into CNF rules as
 361 probN and probA (non-adjacent and adjacent probabilites).
 362 </p>
 363 <p>
 364 When re-estimating, we use the expected values from inner() to get new
 365 values for P_STOP and P_CHOOSE. When we've re-estimated for the entire
 366 corpus, we distribute P_STOP and P_CHOOSE into the CNF rules again, so
 367 that in the next round we use new probN and probA to find
 368 inner-probabilites.
 369 </p>
 370 <p>
 371 The distribution of P_STOP and P_CHOOSE into CNF rules also happens in
 372 init_normalize() (here along with the creation of P_STOP and
 373 P_CHOOSE); P_STOP is used to create CNF rules where one branch of the
 374 rule is STOP, P_CHOOSE is used to create rules of the form
 375 <pre>
 376  h  -&gt; h  _a_
 377  h_ -&gt; h_ _a_
 378 </pre>
 379 </p>
 380 <p>
 381 Since "adjacency" is not captured in regular CNF rules, we need two
 382 probabilites for each rule, and inner() has to know when to use which.
 383 </p>
 384
 385 <div class="outline-4">
 386 <h4 id="sec-23">1.6.1 <span class="todo">TODO</span> Corpus access</h4>
 387
 388 </div>
 389
 390 <div class="outline-4">
 391 <h4 id="sec-24">1.6.2 <span class="todo">TOGROK</span> sentences or rules as the "outer loop"?                    &nbsp;&nbsp;&nbsp;<span class="tag">OPTIMIZE</span></h4>
 392
 393 <p>In regard to the E/M-step, finding P<sub>STOP</sub>, P<sub>CHOOSE</sub>.
 394 </p>
 395
 396 </div>
 397 </div>
 398 </div>
 399
 400 <div class="outline-2">
 401 <h2 id="sec-25">2 Python-stuff</h2>
 402
 403 <ul>
 404 <li>
 405 <a href="src/pseudo.py">pseudo.py</a>
 406 </li>
 407 <li>
 408 <a href="http://nltk.org/doc/en/structured-programming.html">http://nltk.org/doc/en/structured-programming.html</a> recursive dynamic
 409 </li>
 410 <li>
 411 <a href="http://nltk.org/doc/en/advanced-parsing.html">http://nltk.org/doc/en/advanced-parsing.html</a>
 412 </li>
 413 <li>
 414 <a href="http://jaynes.colorado.edu/PythonIdioms.html">http://jaynes.colorado.edu/PythonIdioms.html</a>
 415
 416
 417 </li>
 418 </ul>
 419 </div>
 420 <div id="postamble"><p class="author"> Author: Kevin Brubeck Unhammer
 421 <a href="mailto:K.BrubeckUnhammer at student uva nl ">&lt;K.BrubeckUnhammer at student uva nl &gt;</a>
 422 </p>
 423 <p class="date"> Date: 2008/06/03 14:04:08</p>
 424 </div><p class="postamble">Skrive vha. emacs + <a href='http://orgmode.org/'>org-mode</a></p></body>
 425 </html>