DMVCCM.html.~21~

   1 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
   2                "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
   3 <html xmlns="http://www.w3.org/1999/xhtml"
   4 lang="en" xml:lang="en">
   5 <head>
   6 <title>DMV/CCM</title>
   7 <meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>
   8 <meta name="generator" content="Org-mode"/>
   9 <meta name="generated" content="2008/06/04 14:23:23"/>
  10 <meta name="author" content="Kevin Brubeck Unhammer"/>
  11 <link rel="stylesheet" type="text/css" href="http://www.student.uib.no/~kun041/org.css">
  12 </head><body>
  13 <h1 class="title">DMV/CCM</h1>
  14 <div id="table-of-contents">
  15 <h2>Table of Contents</h2>
  16 <ul>
  17 <li><a href="#sec-1">1 dmvccm</a></li>
  18 <li><a href="#sec-2">2 Adjacency and combining it with inner()</a></li>
  19 <li><a href="#sec-3">3 [#A] P_STOP for IO/EM</a></li>
  20 <li><a href="#sec-4">4 P_CHOOSE for IO/EM</a></li>
  21 <li><a href="#sec-5">5 Initialization   </a></li>
  22 <li><a href="#sec-6">6 [#C] Deferred</a></li>
  23 <li><a href="#sec-7">7 Expectation Maximation in IO/DMV-terms</a></li>
  24 <li><a href="#sec-8">8 Python-stuff</a></li>
  25 </ul>
  26 </div>
  27
  28 <div class="outline-2">
  29 <h2 id="sec-1">1 dmvccm</h2>
  30
  31 <p><span class="timestamp-kwd">DEADLINE: </span> <span class="timestamp">2008-06-30 Mon</span><br/>
  32 (But absolute, extended, really-quite-dead-now deadline: August 31&hellip;)
  33 <a href="src/dmv.py">dmv.py</a>
  34 <a href="src/io.py">io.py</a>
  35 </p></div>
  36
  37 <div class="outline-2">
  38 <h2 id="sec-2">2 <span class="todo">TODO</span> Adjacency and combining it with inner()</h2>
  39
  40 <p>Each DMV_Rule now has both a probN and a probA, for
  41 adjacencies. inner() needs the correct one in each case.
  42 </p>
  43 <p>
  44 Adjacency gives a problem with duplicate words/tags, eg. in the
  45 sentence "a a b". If this has the dependency structure b-&gt;a<sub>0</sub>-&gt;a<sub>1</sub>,
  46 then b is non-adjacent to a<sub>0</sub> and should use probN (for the LRStop and
  47 the attachment of a<sub>0</sub>), while the other rules should all use
  48 probA. But within the e(0,2,b) we can't just say "oh, a has index 0
  49 so it's not adjacent to 2", since there's also an a at index 1, and
  50 there's also a dependency structure b-&gt;a<sub>1</sub>-&gt;a<sub>0</sub> for that. We want
  51 both. And in possibly much more complex versions.
  52 </p>
  53 <p>
  54 Ideas:
  55 </p><ul>
  56 <li>
  57 I first thought of decorating the individual words/tags in a
  58 sentence with their indices, and perhaps just duplicating the
  59 relevant rules (one for each index of the duplicate tags). But this
  60 gives an explosion in attachment rules (although a contained
  61 explosion, within the rules used in a sentence; but most sentences
  62 will have at least two NN's so it will be a problem).
  63 </li>
  64 <li>
  65 Then, I had a <i>brilliant</i> idea. Just let e(), the helper function of
  66 inner(), parametrize for an extra pair of boolean values for whether
  67 or not we've attached anything to the left or right yet ("yet"
  68 meaning "below"). So now, e() has a chart of the form [s, t, LHS,
  69 Lattach, Rattach], and of course e(s,t,LHS) is the sum of the four
  70 possible values for (Lattach,Rattach). This makes e() lots more
  71 complex and DMV-specific though, so it's been rewritten in
  72 inner_dmv() in dmv.py.
  73 </li>
  74 <li><span class="todo">TODO</span> document this adjacency stuff better<br/>
  75 </li>
  76 <li><span class="todo">TODO</span> test and debug my brilliant idea<br/>
  77 </li>
  78 <li><span class="done">DONE</span> implement my brilliant idea.<br/>
  79 <span class="timestamp-kwd">CLOSED: </span> <span class="timestamp">2008-06-01 Sun 17:19</span><br/>
  80 <a href="src/dmv.py">e(sti) in dmv.py</a>
  81
  82 </li>
  83 <li><span class="done">DONE</span> [#A] test inner() on sentences with duplicate words<br/>
  84 Works with eg. the sentence "h h h"
  85
  86
  87 </li>
  88 </ul>
  89 </div>
  90
  91 <div class="outline-2">
  92 <h2 id="sec-3">3 <span class="todo">TODO</span> [#A] P_STOP for IO/EM</h2>
  93
  94 <p><a href="src/dmv.py">dmv-P_STOP</a>
  95 Remember: The P<sub>STOP</sub> formula is upside-down (left-to-right also).
  96 (In the article..not the thesis)
  97 </p>
  98 <p>
  99 Remember: Initialization makes some "short-cut" rules, these will also
 100 have to be updated along with the other P<sub>STOP</sub> updates:
 101 </p><ul>
 102 <li>
 103 b[(NOBAR, n<sub>h</sub>), 'h'] = 1.0       # always
 104 </li>
 105 <li>
 106 b[(RBAR, n<sub>h</sub>), 'h'] = h_.probA  # h_ is RBAR stop rule
 107 </li>
 108 <li>
 109 b[(LRBAR, n<sub>h</sub>), 'h'] = h_.probA * _ h_.probA
 110
 111 </li>
 112 <li>How is the P_STOP formula different given other values for dir and adj?<br/>
 113 (Presumably, the P<sub>STOP</sub> formula where STOP is True is just the
 114 rule-probability of _ h_ -&gt; STOP h_ or h_ -&gt; h STOP, but how does
 115 adjacency fit in here?)
 116
 117 <p>
 118 (And P<sub>STOP</sub>(-STOP|&hellip;) = 1 - P<sub>STOP</sub>(STOP|&hellip;) )
 119 </p></li>
 120 </ul>
 121 </div>
 122
 123 <div class="outline-2">
 124 <h2 id="sec-4">4 <span class="todo">TODO</span> P_CHOOSE for IO/EM</h2>
 125
 126 <p>Write the formulas! should be easy?
 127 </p></div>
 128
 129 <div class="outline-2">
 130 <h2 id="sec-5">5 Initialization   </h2>
 131
 132 <p><a href="/Users/kiwibird/Documents/Skole/V08/Probability/dmvccm/src/dmv.py">dmv-inits</a>
 133 </p>
 134 <p>
 135 We do have to go through the corpus, since the probabilities are based
 136 on how far away in the sentence arguments are from their heads.
 137 </p><ul>
 138 <li><span class="todo">TODO</span> Separate initialization to another file?                      &nbsp;&nbsp;&nbsp;<span class="tag">PRETTIER</span><br/>
 139 (It's rather messy.)
 140 </li>
 141 <li><span class="todo">TOGROK</span> CCM Initialization    <br/>
 142 P<sub>SPLIT</sub> used here&hellip; how, again?
 143 </li>
 144 <li><span class="done">DONE</span> DMV Initialization probabilities<br/>
 145 (from initialization frequency)
 146 </li>
 147 <li><span class="done">DONE</span> DMV Initialization frequencies    <br/>
 148 <span class="timestamp-kwd">CLOSED: </span> <span class="timestamp">2008-05-27 Tue 20:04</span><br/>
 149 <ul>
 150 <li>P_STOP    <br/>
 151 P<sub>STOP</sub> is not well defined by K&amp;M. One possible interpretation given
 152 the sentence [det nn vb nn] is
 153 <pre>
 154  f_{STOP}( STOP|det, L, adj) +1
 155  f_{STOP}(-STOP|det, L, adj) +0
 156  f_{STOP}( STOP|det, L, non_adj) +1
 157  f_{STOP}(-STOP|det, L, non_adj) +0
 158  f_{STOP}( STOP|det, R, adj) +0
 159  f_{STOP}(-STOP|det, R, adj) +1
 160
 161  f_{STOP}( STOP|nn, L, adj) +0
 162  f_{STOP}(-STOP|nn, L, adj) +1
 163  f_{STOP}( STOP|nn, L, non_adj) +1  # since there's at least one to the left
 164  f_{STOP}(-STOP|nn, L, non_adj) +0
 165 </pre>
 166 <ul>
 167 <li><span class="todo">TODO</span> tweak<br/>
 168 <a name="pstoptweak">&nbsp;</a>
 169 <pre>
 170             f[head,  'STOP', 'LN'] += (i_h &lt;= 1)     # first two words
 171             f[head, '-STOP', 'LN'] += (not i_h &lt;= 1)
 172             f[head,  'STOP', 'LA'] += (i_h == 0)     # very first word
 173             f[head, '-STOP', 'LA'] += (not i_h == 0)
 174             f[head,  'STOP', 'RN'] += (i_h &gt;= n - 2) # last two words
 175             f[head, '-STOP', 'RN'] += (not i_h &gt;= n - 2)
 176             f[head,  'STOP', 'RA'] += (i_h == n - 1) # very last word
 177             f[head, '-STOP', 'RA'] += (not i_h == n - 1)
 178 </pre>
 179 vs
 180 <pre>
 181             # this one requires some additional rewriting since it
 182             # introduces divisions by zero
 183             f[head,  'STOP', 'LN'] += (i_h == 1)     # second word
 184             f[head, '-STOP', 'LN'] += (not i_h &lt;= 1) # not first two
 185             f[head,  'STOP', 'LA'] += (i_h == 0)     # first word
 186             f[head, '-STOP', 'LA'] += (not i_h == 0) # not first
 187             f[head,  'STOP', 'RN'] += (i_h == n - 2)     # second-to-last
 188             f[head, '-STOP', 'RN'] += (not i_h &gt;= n - 2) # not last two
 189             f[head,  'STOP', 'RA'] += (i_h == n - 1)     # last word
 190             f[head, '-STOP', 'RA'] += (not i_h == n - 1) # not last
 191 </pre>
 192 vs
 193 <pre>
 194             f[head,  'STOP', 'LN'] += (i_h == 1)     # second word
 195             f[head, '-STOP', 'LN'] += (not i_h == 1) # not second
 196             f[head,  'STOP', 'LA'] += (i_h == 0)     # first word
 197             f[head, '-STOP', 'LA'] += (not i_h == 0) # not first
 198             f[head,  'STOP', 'RN'] += (i_h == n - 2)     # second-to-last
 199             f[head, '-STOP', 'RN'] += (not i_h == n - 2) # not second-to-last
 200             f[head,  'STOP', 'RA'] += (i_h == n - 1)     # last word
 201             f[head, '-STOP', 'RA'] += (not i_h == n - 1) # not last
 202 </pre>
 203 vs
 204 "all words take the same number of arguments" interpreted as
 205 <pre>
 206 for all heads:
 207     p_STOP(head, 'STOP', 'LN') = 0.3
 208     p_STOP(head, 'STOP', 'LA') = 0.5
 209     p_STOP(head, 'STOP', 'RN') = 0.4
 210     p_STOP(head, 'STOP', 'RA') = 0.7
 211 </pre>
 212 (which we easily may tweak in init_zeros())
 213 </li>
 214 </ul>
 215 </li>
 216 <li>P_CHOOSE<br/>
 217 Go through the corpus, counting distances between heads and
 218 arguments. In [det nn vb nn], we give
 219 <ul>
 220 <li>
 221 f<sub>CHOOSE</sub>(nn|det, R) +1/1 + C
 222 </li>
 223 <li>
 224 f<sub>CHOOSE</sub>(vb|det, R) +1/2 + C
 225 </li>
 226 <li>
 227 f<sub>CHOOSE</sub>(nn|det, R) +1/3 + C
 228 <ul>
 229 <li>
 230 If this were the full corpus, P<sub>CHOOSE</sub>(nn|det, R) would have
 231 (1+1/3+2C) / sum_a f<sub>CHOOSE</sub>(a|det, R)
 232
 233 </li>
 234 </ul></li>
 235 </ul>
 236 <p>The ROOT gets "each argument with equal probability", so in a sentence
 237 of three words, 1/3 for each (in [nn vb nn], 'nn' gets 2/3). Basically
 238 just a frequency count of the corpus&hellip;
 239 </p></li>
 240 </ul>
 241 </li>
 242 </ul>
 243 </div>
 244
 245 <div class="outline-2">
 246 <h2 id="sec-6">6 [#C] Deferred</h2>
 247
 248 <ul>
 249 <li><span class="todo">TODO</span> inner_dmv() should disregard rules with heads not in sent     &nbsp;&nbsp;&nbsp;<span class="tag">OPTIMIZE</span><br/>
 250 If the sentence is "nn vbd det nn", we should not even look at rules
 251 where
 252 <pre>
 253  rule.head() not in "nn vbd det nn".split()
 254 </pre>
 255 This is ruled out by getting rules from g.rules(LHS, sent).
 256
 257 <p>
 258 Also, we optimize this further by saying we don't even recurse into
 259 attachment rules where
 260 <pre>
 261  rule.head() not in sent[ s :r+1]
 262  rule.head() not in sent[r+1:t+1]
 263 </pre>
 264 meaning, if we're looking at the span "vbd det", we only use
 265 attachment rules where both daughters are members of ['vbd','det']
 266 (although we don't (yet) care about removing rules that rewrite to the
 267 same tag if there are no duplicate tags in the span, etc., that would
 268 be a lot of trouble for little potential gain).
 269 </p></li>
 270 <li><span class="todo">TODO</span> when reestimating P_STOP etc, remove rules with p &lt; epsilon   &nbsp;&nbsp;&nbsp;<span class="tag">OPTIMIZE</span><br/>
 271 </li>
 272 <li><span class="todo">TODO</span> inner_dmv, short ranges and impossible attachment             &nbsp;&nbsp;&nbsp;<span class="tag">OPTIMIZE</span><br/>
 273 If s-t &lt;= 2, there can be only one attachment below, so don't recurse
 274 with both Lattach=True and Rattach=True.
 275
 276 <p>
 277 If s-t &lt;= 1, there can be no attachment below, so only recurse with
 278 Lattach=False, Rattach=False.
 279 </p>
 280 <p>
 281 Put this in the loop under rewrite rules (could also do it in the STOP
 282 section, but that would only have an effect on very short sentences).
 283 </p></li>
 284 <li><span class="todo">TODO</span> clean up the module files                                     &nbsp;&nbsp;&nbsp;<span class="tag">PRETTIER</span><br/>
 285 Is there better way to divide dmv and harmonic? There's a two-way
 286 dependency between the modules. Guess there could be a third file that
 287 imports both the initialization and the actual EM stuff, while a file
 288 containing constants and classes could be imported by all others:
 289 <pre>
 290  dmv.py imports dmv_EM.py imports dmv_classes.py
 291  dmv.py imports dmv_inits.py imports dmv_classes.py
 292 </pre>
 293
 294 </li>
 295 <li><span class="todo">TOGROK</span> Some (tagged) sentences are bound to come twice             &nbsp;&nbsp;&nbsp;<span class="tag">OPTIMIZE</span><br/>
 296 Eg, first sort and count, so that the corpus
 297 [['nn','vbd','det','nn'],
 298 ['vbd','nn','det','nn'],
 299 ['nn','vbd','det','nn']]
 300 becomes
 301 [(['nn','vbd','det','nn'],2),
 302 (['vbd','nn','det','nn'],1)]
 303 and then in each loop through sentences, make sure we handle the
 304 frequency correctly.
 305
 306 <p>
 307 Is there much to gain here?
 308 </p>
 309 </li>
 310 <li><span class="todo">TOGROK</span> tags as numbers or tags as strings?                         &nbsp;&nbsp;&nbsp;<span class="tag">OPTIMIZE</span><br/>
 311 Need to clean up the representation.
 312
 313 <p>
 314 Stick with tag-strings in initialization then switch to numbers for
 315 IO-algorithm perhaps? Can probably afford more string-matching in
 316 initialization..
 317 </p></li>
 318 </ul>
 319 </div>
 320
 321 <div class="outline-2">
 322 <h2 id="sec-7">7 Expectation Maximation in IO/DMV-terms</h2>
 323
 324 <p>inner(s,t,LHS) calculates the expected number of trees headed by LHS
 325 from s to t (sentence positions). This uses the P_STOP and P_CHOOSE
 326 values, which have been conveniently distributed into CNF rules as
 327 probN and probA (non-adjacent and adjacent probabilites).
 328 </p>
 329 <p>
 330 When re-estimating, we use the expected values from inner() to get new
 331 values for P_STOP and P_CHOOSE. When we've re-estimated for the entire
 332 corpus, we distribute P_STOP and P_CHOOSE into the CNF rules again, so
 333 that in the next round we use new probN and probA to find
 334 inner-probabilites.
 335 </p>
 336 <p>
 337 The distribution of P_STOP and P_CHOOSE into CNF rules also happens in
 338 init_normalize() (here along with the creation of P_STOP and
 339 P_CHOOSE); P_STOP is used to create CNF rules where one branch of the
 340 rule is STOP, P_CHOOSE is used to create rules of the form
 341 <pre>
 342  h  -&gt; h  _a_
 343  h_ -&gt; h_ _a_
 344 </pre>
 345 </p>
 346 <p>
 347 Since "adjacency" is not captured in regular CNF rules, we need two
 348 probabilites for each rule, and inner() has to know when to use which.
 349 </p>
 350 <ul>
 351 <li><span class="todo">TODO</span> Corpus access<br/>
 352 </li>
 353 <li><span class="todo">TOGROK</span> sentences or rules as the "outer loop"?                     &nbsp;&nbsp;&nbsp;<span class="tag">OPTIMIZE</span><br/>
 354 In regard to the E/M-step, finding P<sub>STOP</sub>, P<sub>CHOOSE</sub>.
 355
 356
 357 </li>
 358 </ul>
 359 </div>
 360
 361 <div class="outline-2">
 362 <h2 id="sec-8">8 Python-stuff</h2>
 363
 364 <ul>
 365 <li>
 366 <a href="src/pseudo.py">pseudo.py</a>
 367 </li>
 368 <li>
 369 <a href="http://nltk.org/doc/en/structured-programming.html">http://nltk.org/doc/en/structured-programming.html</a> recursive dynamic
 370 </li>
 371 <li>
 372 <a href="http://nltk.org/doc/en/advanced-parsing.html">http://nltk.org/doc/en/advanced-parsing.html</a>
 373 </li>
 374 <li>
 375 <a href="http://jaynes.colorado.edu/PythonIdioms.html">http://jaynes.colorado.edu/PythonIdioms.html</a>
 376
 377
 378 </li>
 379 </ul>
 380 </div>
 381 <div id="postamble"><p class="author"> Author: Kevin Brubeck Unhammer
 382 <a href="mailto:K.BrubeckUnhammer at student uva nl ">&lt;K.BrubeckUnhammer at student uva nl &gt;</a>
 383 </p>
 384 <p class="date"> Date: 2008/06/04 14:23:23</p>
 385 </div><p class="postamble">Skrive vha. emacs + <a href='http://orgmode.org/'>org-mode</a></p></body>
 386 </html>