DMVCCM.html.~23~

   1 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
   2                "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
   3 <html xmlns="http://www.w3.org/1999/xhtml"
   4 lang="en" xml:lang="en">
   5 <head>
   6 <title>DMV/CCM &ndash; todo-list / progress</title>
   7 <meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>
   8 <meta name="generator" content="Org-mode"/>
   9 <meta name="generated" content="2008/06/04 14:58:40"/>
  10 <meta name="author" content="Kevin Brubeck Unhammer"/>
  11 <link rel="stylesheet" type="text/css" href="http://www.student.uib.no/~kun041/org.css">
  12 </head><body>
  13 <h1 class="title">DMV/CCM &ndash; todo-list / progress</h1>
  14 <div id="table-of-contents">
  15 <h2>Table of Contents</h2>
  16 <ul>
  17 <li><a href="#sec-1">1 dmvccm report and project</a></li>
  18 <li><a href="#sec-2">2 Adjacency and combining it with inner()</a></li>
  19 <li><a href="#sec-3">3 [#A] P_STOP for IO/EM</a></li>
  20 <li><a href="#sec-4">4 P_CHOOSE for IO/EM</a></li>
  21 <li><a href="#sec-5">5 Initialization   </a></li>
  22 <li><a href="#sec-6">6 [#C] Deferred</a></li>
  23 <li><a href="#sec-7">7 Expectation Maximation in IO/DMV-terms</a></li>
  24 <li><a href="#sec-8">8 Python-stuff</a></li>
  25 </ul>
  26 </div>
  27
  28 <div class="outline-2">
  29 <h2 id="sec-1">1 dmvccm report and project</h2>
  30
  31 <p><span class="timestamp-kwd">DEADLINE: </span> <span class="timestamp">2008-06-30 Mon</span><br/>
  32 But absolute, extended, really-quite-dead-now deadline: August
  33 31&hellip;
  34 </p><ul>
  35 <li>
  36 <a href="src/dmv.py">dmv.py</a>
  37 </li>
  38 <li>
  39 <a href="src/io.py">io.py</a>
  40 </li>
  41 <li>
  42 <a href="src/harmonic.py">harmonic.py</a>
  43 </li>
  44 </ul></div>
  45
  46 <div class="outline-2">
  47 <h2 id="sec-2">2 <span class="todo">TODO</span> Adjacency and combining it with inner()</h2>
  48
  49 <p>Each DMV_Rule now has both a probN and a probA, for
  50 adjacencies. inner() needs the correct one in each case.
  51 </p>
  52 <p>
  53 Adjacency gives a problem with duplicate words/tags, eg. in the
  54 sentence "a a b". If this has the dependency structure b-&gt;a<sub>0</sub>-&gt;a<sub>1</sub>,
  55 then b is non-adjacent to a<sub>0</sub> and should use probN (for the LRStop and
  56 the attachment of a<sub>0</sub>), while the other rules should all use
  57 probA. But within the e(0,2,b) we can't just say "oh, a has index 0
  58 so it's not adjacent to 2", since there's also an a at index 1, and
  59 there's also a dependency structure b-&gt;a<sub>1</sub>-&gt;a<sub>0</sub> for that. We want
  60 both. And in possibly much more complex versions.
  61 </p>
  62 <p>
  63 Ideas:
  64 </p><ul>
  65 <li>
  66 I first thought of decorating the individual words/tags in a
  67 sentence with their indices, and perhaps just duplicating the
  68 relevant rules (one for each index of the duplicate tags). But this
  69 gives an explosion in attachment rules (although a contained
  70 explosion, within the rules used in a sentence; but most sentences
  71 will have at least two NN's so it will be a problem).
  72 </li>
  73 <li>
  74 Then, I had a <i>brilliant</i> idea. Just let e(), the helper function of
  75 inner(), parametrize for an extra pair of boolean values for whether
  76 or not we've attached anything to the left or right yet ("yet"
  77 meaning "below"). So now, e() has a chart of the form [s, t, LHS,
  78 Lattach, Rattach], and of course e(s,t,LHS) is the sum of the four
  79 possible values for (Lattach,Rattach). This makes e() lots more
  80 complex and DMV-specific though, so it's been rewritten in
  81 inner_dmv() in dmv.py.
  82 </li>
  83 <li><span class="todo">TODO</span> document this adjacency stuff better<br/>
  84 </li>
  85 <li><span class="todo">TODO</span> test and debug my brilliant idea<br/>
  86 </li>
  87 <li><span class="done">DONE</span> implement my brilliant idea.<br/>
  88 <span class="timestamp-kwd">CLOSED: </span> <span class="timestamp">2008-06-01 Sun 17:19</span><br/>
  89 <a href="src/dmv.py">e(sti) in dmv.py</a>
  90
  91 </li>
  92 <li><span class="done">DONE</span> [#A] test inner() on sentences with duplicate words<br/>
  93 Works with eg. the sentence "h h h"
  94
  95
  96 </li>
  97 </ul>
  98 </div>
  99
 100 <div class="outline-2">
 101 <h2 id="sec-3">3 <span class="todo">TODO</span> [#A] P_STOP for IO/EM</h2>
 102
 103 <p><a href="src/dmv.py">dmv-P_STOP</a>
 104 Remember: The P<sub>STOP</sub> formula is upside-down (left-to-right also).
 105 (In the article..not the thesis)
 106 </p>
 107 <p>
 108 Remember: Initialization makes some "short-cut" rules, these will also
 109 have to be updated along with the other P<sub>STOP</sub> updates:
 110 </p><ul>
 111 <li>
 112 b[(NOBAR, n<sub>h</sub>), 'h'] = 1.0       # always
 113 </li>
 114 <li>
 115 b[(RBAR, n<sub>h</sub>), 'h'] = h_.probA  # h_ is RBAR stop rule
 116 </li>
 117 <li>
 118 b[(LRBAR, n<sub>h</sub>), 'h'] = h_.probA * _ h_.probA
 119
 120 </li>
 121 <li>How is the P_STOP formula different given other values for dir and adj?<br/>
 122 (Presumably, the P<sub>STOP</sub> formula where STOP is True is just the
 123 rule-probability of _ h_ -&gt; STOP h_ or h_ -&gt; h STOP, but how does
 124 adjacency fit in here?)
 125
 126 <p>
 127 (And P<sub>STOP</sub>(-STOP|&hellip;) = 1 - P<sub>STOP</sub>(STOP|&hellip;) )
 128 </p></li>
 129 </ul>
 130 </div>
 131
 132 <div class="outline-2">
 133 <h2 id="sec-4">4 <span class="todo">TODO</span> P_CHOOSE for IO/EM</h2>
 134
 135 <p>Write the formulas! should be easy?
 136 </p></div>
 137
 138 <div class="outline-2">
 139 <h2 id="sec-5">5 Initialization   </h2>
 140
 141 <p><a href="/Users/kiwibird/Documents/Skole/V08/Probability/dmvccm/src/dmv.py">dmv-inits</a>
 142 </p>
 143 <p>
 144 We do have to go through the corpus, since the probabilities are based
 145 on how far away in the sentence arguments are from their heads.
 146 </p><ul>
 147 <li><span class="todo">TODO</span> Separate initialization to another file?                      &nbsp;&nbsp;&nbsp;<span class="tag">PRETTIER</span><br/>
 148 (It's rather messy.)
 149 </li>
 150 <li><span class="todo">TOGROK</span> CCM Initialization    <br/>
 151 P<sub>SPLIT</sub> used here&hellip; how, again?
 152 </li>
 153 <li><span class="done">DONE</span> DMV Initialization probabilities<br/>
 154 (from initialization frequency)
 155 </li>
 156 <li><span class="done">DONE</span> DMV Initialization frequencies    <br/>
 157 <span class="timestamp-kwd">CLOSED: </span> <span class="timestamp">2008-05-27 Tue 20:04</span><br/>
 158 <ul>
 159 <li>P_STOP    <br/>
 160 P<sub>STOP</sub> is not well defined by K&amp;M. One possible interpretation given
 161 the sentence [det nn vb nn] is
 162 <pre>
 163  f_{STOP}( STOP|det, L, adj) +1
 164  f_{STOP}(-STOP|det, L, adj) +0
 165  f_{STOP}( STOP|det, L, non_adj) +1
 166  f_{STOP}(-STOP|det, L, non_adj) +0
 167  f_{STOP}( STOP|det, R, adj) +0
 168  f_{STOP}(-STOP|det, R, adj) +1
 169
 170  f_{STOP}( STOP|nn, L, adj) +0
 171  f_{STOP}(-STOP|nn, L, adj) +1
 172  f_{STOP}( STOP|nn, L, non_adj) +1  # since there's at least one to the left
 173  f_{STOP}(-STOP|nn, L, non_adj) +0
 174 </pre>
 175 <ul>
 176 <li><span class="todo">TODO</span> tweak<br/>
 177 <a name="pstoptweak">&nbsp;</a>
 178 <pre>
 179             f[head,  'STOP', 'LN'] += (i_h &lt;= 1)     # first two words
 180             f[head, '-STOP', 'LN'] += (not i_h &lt;= 1)
 181             f[head,  'STOP', 'LA'] += (i_h == 0)     # very first word
 182             f[head, '-STOP', 'LA'] += (not i_h == 0)
 183             f[head,  'STOP', 'RN'] += (i_h &gt;= n - 2) # last two words
 184             f[head, '-STOP', 'RN'] += (not i_h &gt;= n - 2)
 185             f[head,  'STOP', 'RA'] += (i_h == n - 1) # very last word
 186             f[head, '-STOP', 'RA'] += (not i_h == n - 1)
 187 </pre>
 188 vs
 189 <pre>
 190             # this one requires some additional rewriting since it
 191             # introduces divisions by zero
 192             f[head,  'STOP', 'LN'] += (i_h == 1)     # second word
 193             f[head, '-STOP', 'LN'] += (not i_h &lt;= 1) # not first two
 194             f[head,  'STOP', 'LA'] += (i_h == 0)     # first word
 195             f[head, '-STOP', 'LA'] += (not i_h == 0) # not first
 196             f[head,  'STOP', 'RN'] += (i_h == n - 2)     # second-to-last
 197             f[head, '-STOP', 'RN'] += (not i_h &gt;= n - 2) # not last two
 198             f[head,  'STOP', 'RA'] += (i_h == n - 1)     # last word
 199             f[head, '-STOP', 'RA'] += (not i_h == n - 1) # not last
 200 </pre>
 201 vs
 202 <pre>
 203             f[head,  'STOP', 'LN'] += (i_h == 1)     # second word
 204             f[head, '-STOP', 'LN'] += (not i_h == 1) # not second
 205             f[head,  'STOP', 'LA'] += (i_h == 0)     # first word
 206             f[head, '-STOP', 'LA'] += (not i_h == 0) # not first
 207             f[head,  'STOP', 'RN'] += (i_h == n - 2)     # second-to-last
 208             f[head, '-STOP', 'RN'] += (not i_h == n - 2) # not second-to-last
 209             f[head,  'STOP', 'RA'] += (i_h == n - 1)     # last word
 210             f[head, '-STOP', 'RA'] += (not i_h == n - 1) # not last
 211 </pre>
 212 vs
 213 "all words take the same number of arguments" interpreted as
 214 <pre>
 215 for all heads:
 216     p_STOP(head, 'STOP', 'LN') = 0.3
 217     p_STOP(head, 'STOP', 'LA') = 0.5
 218     p_STOP(head, 'STOP', 'RN') = 0.4
 219     p_STOP(head, 'STOP', 'RA') = 0.7
 220 </pre>
 221 (which we easily may tweak in init_zeros())
 222 </li>
 223 </ul>
 224 </li>
 225 <li>P_CHOOSE<br/>
 226 Go through the corpus, counting distances between heads and
 227 arguments. In [det nn vb nn], we give
 228 <ul>
 229 <li>
 230 f<sub>CHOOSE</sub>(nn|det, R) +1/1 + C
 231 </li>
 232 <li>
 233 f<sub>CHOOSE</sub>(vb|det, R) +1/2 + C
 234 </li>
 235 <li>
 236 f<sub>CHOOSE</sub>(nn|det, R) +1/3 + C
 237 <ul>
 238 <li>
 239 If this were the full corpus, P<sub>CHOOSE</sub>(nn|det, R) would have
 240 (1+1/3+2C) / sum_a f<sub>CHOOSE</sub>(a|det, R)
 241
 242 </li>
 243 </ul></li>
 244 </ul>
 245 <p>The ROOT gets "each argument with equal probability", so in a sentence
 246 of three words, 1/3 for each (in [nn vb nn], 'nn' gets 2/3). Basically
 247 just a frequency count of the corpus&hellip;
 248 </p></li>
 249 </ul>
 250 </li>
 251 </ul>
 252 </div>
 253
 254 <div class="outline-2">
 255 <h2 id="sec-6">6 [#C] Deferred</h2>
 256
 257 <ul>
 258 <li><span class="todo">TODO</span> inner_dmv() should disregard rules with heads not in sent     &nbsp;&nbsp;&nbsp;<span class="tag">OPTIMIZE</span><br/>
 259 If the sentence is "nn vbd det nn", we should not even look at rules
 260 where
 261 <pre>
 262  rule.head() not in "nn vbd det nn".split()
 263 </pre>
 264 This is ruled out by getting rules from g.rules(LHS, sent).
 265
 266 <p>
 267 Also, we optimize this further by saying we don't even recurse into
 268 attachment rules where
 269 <pre>
 270  rule.head() not in sent[ s :r+1]
 271  rule.head() not in sent[r+1:t+1]
 272 </pre>
 273 meaning, if we're looking at the span "vbd det", we only use
 274 attachment rules where both daughters are members of ['vbd','det']
 275 (although we don't (yet) care about removing rules that rewrite to the
 276 same tag if there are no duplicate tags in the span, etc., that would
 277 be a lot of trouble for little potential gain).
 278 </p></li>
 279 <li><span class="todo">TODO</span> when reestimating P_STOP etc, remove rules with p &lt; epsilon   &nbsp;&nbsp;&nbsp;<span class="tag">OPTIMIZE</span><br/>
 280 </li>
 281 <li><span class="todo">TODO</span> inner_dmv, short ranges and impossible attachment             &nbsp;&nbsp;&nbsp;<span class="tag">OPTIMIZE</span><br/>
 282 If s-t &lt;= 2, there can be only one attachment below, so don't recurse
 283 with both Lattach=True and Rattach=True.
 284
 285 <p>
 286 If s-t &lt;= 1, there can be no attachment below, so only recurse with
 287 Lattach=False, Rattach=False.
 288 </p>
 289 <p>
 290 Put this in the loop under rewrite rules (could also do it in the STOP
 291 section, but that would only have an effect on very short sentences).
 292 </p></li>
 293 <li><span class="todo">TODO</span> clean up the module files                                     &nbsp;&nbsp;&nbsp;<span class="tag">PRETTIER</span><br/>
 294 Is there better way to divide dmv and harmonic? There's a two-way
 295 dependency between the modules. Guess there could be a third file that
 296 imports both the initialization and the actual EM stuff, while a file
 297 containing constants and classes could be imported by all others:
 298 <pre>
 299  dmv.py imports dmv_EM.py imports dmv_classes.py
 300  dmv.py imports dmv_inits.py imports dmv_classes.py
 301 </pre>
 302
 303 </li>
 304 <li><span class="todo">TOGROK</span> Some (tagged) sentences are bound to come twice             &nbsp;&nbsp;&nbsp;<span class="tag">OPTIMIZE</span><br/>
 305 Eg, first sort and count, so that the corpus
 306 [['nn','vbd','det','nn'],
 307 ['vbd','nn','det','nn'],
 308 ['nn','vbd','det','nn']]
 309 becomes
 310 [(['nn','vbd','det','nn'],2),
 311 (['vbd','nn','det','nn'],1)]
 312 and then in each loop through sentences, make sure we handle the
 313 frequency correctly.
 314
 315 <p>
 316 Is there much to gain here?
 317 </p>
 318 </li>
 319 <li><span class="todo">TOGROK</span> tags as numbers or tags as strings?                         &nbsp;&nbsp;&nbsp;<span class="tag">OPTIMIZE</span><br/>
 320 Need to clean up the representation.
 321
 322 <p>
 323 Stick with tag-strings in initialization then switch to numbers for
 324 IO-algorithm perhaps? Can probably afford more string-matching in
 325 initialization..
 326 </p></li>
 327 </ul>
 328 </div>
 329
 330 <div class="outline-2">
 331 <h2 id="sec-7">7 Expectation Maximation in IO/DMV-terms</h2>
 332
 333 <p>inner(s,t,LHS) calculates the expected number of trees headed by LHS
 334 from s to t (sentence positions). This uses the P_STOP and P_CHOOSE
 335 values, which have been conveniently distributed into CNF rules as
 336 probN and probA (non-adjacent and adjacent probabilites).
 337 </p>
 338 <p>
 339 When re-estimating, we use the expected values from inner() to get new
 340 values for P_STOP and P_CHOOSE. When we've re-estimated for the entire
 341 corpus, we distribute P_STOP and P_CHOOSE into the CNF rules again, so
 342 that in the next round we use new probN and probA to find
 343 inner-probabilites.
 344 </p>
 345 <p>
 346 The distribution of P_STOP and P_CHOOSE into CNF rules also happens in
 347 init_normalize() (here along with the creation of P_STOP and
 348 P_CHOOSE); P_STOP is used to create CNF rules where one branch of the
 349 rule is STOP, P_CHOOSE is used to create rules of the form
 350 <pre>
 351  h  -&gt; h  _a_
 352  h_ -&gt; h_ _a_
 353 </pre>
 354 </p>
 355 <p>
 356 Since "adjacency" is not captured in regular CNF rules, we need two
 357 probabilites for each rule, and inner() has to know when to use which.
 358 </p>
 359 <ul>
 360 <li><span class="todo">TODO</span> Corpus access<br/>
 361 </li>
 362 <li><span class="todo">TOGROK</span> sentences or rules as the "outer loop"?                     &nbsp;&nbsp;&nbsp;<span class="tag">OPTIMIZE</span><br/>
 363 In regard to the E/M-step, finding P<sub>STOP</sub>, P<sub>CHOOSE</sub>.
 364
 365
 366 </li>
 367 </ul>
 368 </div>
 369
 370 <div class="outline-2">
 371 <h2 id="sec-8">8 Python-stuff</h2>
 372
 373 <ul>
 374 <li>
 375 <a href="src/pseudo.py">pseudo.py</a>
 376 </li>
 377 <li>
 378 <a href="http://nltk.org/doc/en/structured-programming.html">http://nltk.org/doc/en/structured-programming.html</a> recursive dynamic
 379 </li>
 380 <li>
 381 <a href="http://nltk.org/doc/en/advanced-parsing.html">http://nltk.org/doc/en/advanced-parsing.html</a>
 382 </li>
 383 <li>
 384 <a href="http://jaynes.colorado.edu/PythonIdioms.html">http://jaynes.colorado.edu/PythonIdioms.html</a>
 385
 386
 387 </li>
 388 </ul>
 389 </div>
 390 <div id="postamble"><p class="author"> Author: Kevin Brubeck Unhammer
 391 <a href="mailto:K.BrubeckUnhammer at student uva nl ">&lt;K.BrubeckUnhammer at student uva nl &gt;</a>
 392 </p>
 393 <p class="date"> Date: 2008/06/04 14:58:40</p>
 394 </div><p class="postamble">Skrive vha. emacs + <a href='http://orgmode.org/'>org-mode</a></p></body>
 395 </html>