DMVCCM.org.~13~

   1 # -*- coding: mule-utf-8-unix -*-
   2 #+OPTIONS: H:4 toc:3 ^:{}
   3 #+STARTUP: overview
   4 #+TAGS: OPTIMIZE PRETTIER
   5 #+STARTUP: hidestars
   6 #+TITLE: DMV/CCM
   7 #+AUTHOR: Kevin Brubeck Unhammer
   8 #+EMAIL: K.BrubeckUnhammer at student uva nl
   9 #+LANGUAGE: en
  10 #+SEQ_TODO: TOGROK TODO DONE
  11
  12
  13 * dmvccm report and project
  14   DEADLINE: <2008-06-30 Mon>
  15 But absolute, extended, really-quite-dead-now deadline: August 31...
  16 - [[file:src/dmv.py][dmv.py]]
  17 - [[file:src/io.py][io.py]]
  18 - [[file:src/harmonic.py::harmonic%20py%20initialization%20for%20dmv][harmonic.py]]
  19 * TODO Adjacency and combining it with inner()
  20 Each DMV_Rule now has both a probN and a probA, for
  21 adjacencies. inner() needs the correct one in each case.
  22
  23 Adjacency gives a problem with duplicate words/tags, eg. in the
  24 sentence "a a b". If this has the dependency structure b->a_{0}->a_{1},
  25 then b is non-adjacent to a_{0} and should use probN (for the LRStop and
  26 the attachment of a_{0}), while the other rules should all use
  27 probA. But within the e(0,2,b) we can't just say "oh, a has index 0
  28 so it's not adjacent to 2", since there's also an a at index 1, and
  29 there's also a dependency structure b->a_{1}->a_{0} for that. We want
  30 both. And in possibly much more complex versions.
  31
  32 Ideas:
  33 - I first thought of decorating the individual words/tags in a
  34   sentence with their indices, and perhaps just duplicating the
  35   relevant rules (one for each index of the duplicate tags). But this
  36   gives an explosion in attachment rules (although a contained
  37   explosion, within the rules used in a sentence; but most sentences
  38   will have at least two NN's so it will be a problem).
  39 - Then, I had a /brilliant/ idea. Just let e(), the helper function of
  40   inner(), parametrize for an extra pair of boolean values for whether
  41   or not we've attached anything to the left or right yet ("yet"
  42   meaning "below"). So now, e() has a chart of the form [s, t, LHS,
  43   Lattach, Rattach], and of course e(s,t,LHS) is the sum of the four
  44   possible values for (Lattach,Rattach). This makes e() lots more
  45   complex and DMV-specific though, so it's been rewritten in
  46   inner_dmv() in dmv.py.
  47 ** TODO document this adjacency stuff better
  48 ** TODO test and debug my brilliant idea
  49 ** DONE implement my brilliant idea.
  50     CLOSED: [2008-06-01 Sun 17:19]
  51 [[file:src/dmv.py::def%20e%20s%20t%20LHS%20Lattach%20Rattach][e(sti) in dmv.py]]
  52
  53 ** DONE [#A] test inner() on sentences with duplicate words
  54 Works with eg. the sentence "h h h"
  55
  56
  57 * TODO [#A] P_STOP for IO/EM
  58 [[file:src/dmv.py::DMV%20probabilities][dmv-P_STOP]]
  59 Remember: The P_{STOP} formula is upside-down (left-to-right also).
  60 (In the article..not the thesis)
  61
  62 Remember: Initialization makes some "short-cut" rules, these will also
  63 have to be updated along with the other P_{STOP} updates:
  64 - b[(NOBAR, n_{h}), 'h'] = 1.0       # always
  65 - b[(RBAR, n_{h}), 'h'] = h_.probA  # h_ is RBAR stop rule
  66 - b[(LRBAR, n_{h}), 'h'] = h_.probA * _ h_.probA
  67
  68 ** How is the P_STOP formula different given other values for dir and adj?
  69 (Presumably, the P_{STOP} formula where STOP is True is just the
  70 rule-probability of _ h_ -> STOP h_ or h_ -> h STOP, but how does
  71 adjacency fit in here?)
  72
  73 (And P_{STOP}(-STOP|...) = 1 - P_{STOP}(STOP|...) )
  74 * TODO P_CHOOSE for IO/EM
  75 Write the formulas! should be easy?
  76 * Initialization
  77 [[file:~/Documents/Skole/V08/Probability/dmvccm/src/dmv.py::Initialization%20todo][dmv-inits]]
  78
  79 We do have to go through the corpus, since the probabilities are based
  80 on how far away in the sentence arguments are from their heads.
  81 ** TODO Separate initialization to another file?                      :PRETTIER:
  82 (It's rather messy.)
  83 ** TOGROK CCM Initialization
  84 P_{SPLIT} used here... how, again?
  85 ** DONE DMV Initialization probabilities
  86 (from initialization frequency)
  87 ** DONE DMV Initialization frequencies
  88    CLOSED: [2008-05-27 Tue 20:04]
  89 *** P_STOP
  90 P_{STOP} is not well defined by K&M. One possible interpretation given
  91 the sentence [det nn vb nn] is
  92 : f_{STOP}( STOP|det, L, adj) +1
  93 : f_{STOP}(-STOP|det, L, adj) +0
  94 : f_{STOP}( STOP|det, L, non_adj) +1
  95 : f_{STOP}(-STOP|det, L, non_adj) +0
  96 : f_{STOP}( STOP|det, R, adj) +0
  97 : f_{STOP}(-STOP|det, R, adj) +1
  98 :
  99 : f_{STOP}( STOP|nn, L, adj) +0
 100 : f_{STOP}(-STOP|nn, L, adj) +1
 101 : f_{STOP}( STOP|nn, L, non_adj) +1  # since there's at least one to the left
 102 : f_{STOP}(-STOP|nn, L, non_adj) +0
 103 **** TODO tweak
 104 # <<pstoptweak>>
 105 :            f[head,  'STOP', 'LN'] += (i_h <= 1)     # first two words
 106 :            f[head, '-STOP', 'LN'] += (not i_h <= 1)
 107 :            f[head,  'STOP', 'LA'] += (i_h == 0)     # very first word
 108 :            f[head, '-STOP', 'LA'] += (not i_h == 0)
 109 :            f[head,  'STOP', 'RN'] += (i_h >= n - 2) # last two words
 110 :            f[head, '-STOP', 'RN'] += (not i_h >= n - 2)
 111 :            f[head,  'STOP', 'RA'] += (i_h == n - 1) # very last word
 112 :            f[head, '-STOP', 'RA'] += (not i_h == n - 1)
 113 vs
 114 :            # this one requires some additional rewriting since it
 115 :            # introduces divisions by zero
 116 :            f[head,  'STOP', 'LN'] += (i_h == 1)     # second word
 117 :            f[head, '-STOP', 'LN'] += (not i_h <= 1) # not first two
 118 :            f[head,  'STOP', 'LA'] += (i_h == 0)     # first word
 119 :            f[head, '-STOP', 'LA'] += (not i_h == 0) # not first
 120 :            f[head,  'STOP', 'RN'] += (i_h == n - 2)     # second-to-last
 121 :            f[head, '-STOP', 'RN'] += (not i_h >= n - 2) # not last two
 122 :            f[head,  'STOP', 'RA'] += (i_h == n - 1)     # last word
 123 :            f[head, '-STOP', 'RA'] += (not i_h == n - 1) # not last
 124 vs
 125 :            f[head,  'STOP', 'LN'] += (i_h == 1)     # second word
 126 :            f[head, '-STOP', 'LN'] += (not i_h == 1) # not second
 127 :            f[head,  'STOP', 'LA'] += (i_h == 0)     # first word
 128 :            f[head, '-STOP', 'LA'] += (not i_h == 0) # not first
 129 :            f[head,  'STOP', 'RN'] += (i_h == n - 2)     # second-to-last
 130 :            f[head, '-STOP', 'RN'] += (not i_h == n - 2) # not second-to-last
 131 :            f[head,  'STOP', 'RA'] += (i_h == n - 1)     # last word
 132 :            f[head, '-STOP', 'RA'] += (not i_h == n - 1) # not last
 133 vs
 134 "all words take the same number of arguments" interpreted as
 135 :for all heads:
 136 :    p_STOP(head, 'STOP', 'LN') = 0.3
 137 :    p_STOP(head, 'STOP', 'LA') = 0.5
 138 :    p_STOP(head, 'STOP', 'RN') = 0.4
 139 :    p_STOP(head, 'STOP', 'RA') = 0.7
 140 (which we easily may tweak in init_zeros())
 141 *** P_CHOOSE
 142 Go through the corpus, counting distances between heads and
 143 arguments. In [det nn vb nn], we give
 144 - f_{CHOOSE}(nn|det, R) +1/1 + C
 145 - f_{CHOOSE}(vb|det, R) +1/2 + C
 146 - f_{CHOOSE}(nn|det, R) +1/3 + C
 147   - If this were the full corpus, P_{CHOOSE}(nn|det, R) would have
 148     (1+1/3+2C) / sum_a f_{CHOOSE}(a|det, R)
 149
 150 The ROOT gets "each argument with equal probability", so in a sentence
 151 of three words, 1/3 for each (in [nn vb nn], 'nn' gets 2/3). Basically
 152 just a frequency count of the corpus...
 153 * [#C] Deferred
 154 ** TODO inner_dmv() should disregard rules with heads not in sent     :OPTIMIZE:
 155 If the sentence is "nn vbd det nn", we should not even look at rules
 156 where
 157 : rule.head() not in "nn vbd det nn".split()
 158 This is ruled out by getting rules from g.rules(LHS, sent).
 159
 160 Also, we optimize this further by saying we don't even recurse into
 161 attachment rules where
 162 : rule.head() not in sent[ s :r+1]
 163 : rule.head() not in sent[r+1:t+1]
 164 meaning, if we're looking at the span "vbd det", we only use
 165 attachment rules where both daughters are members of ['vbd','det']
 166 (although we don't (yet) care about removing rules that rewrite to the
 167 same tag if there are no duplicate tags in the span, etc., that would
 168 be a lot of trouble for little potential gain).
 169 ** TODO when reestimating P_STOP etc, remove rules with p < epsilon   :OPTIMIZE:
 170 ** TODO inner_dmv, short ranges and impossible attachment             :OPTIMIZE:
 171 If s-t <= 2, there can be only one attachment below, so don't recurse
 172 with both Lattach=True and Rattach=True.
 173
 174 If s-t <= 1, there can be no attachment below, so only recurse with
 175 Lattach=False, Rattach=False.
 176
 177 Put this in the loop under rewrite rules (could also do it in the STOP
 178 section, but that would only have an effect on very short sentences).
 179 ** TODO clean up the module files                                     :PRETTIER:
 180 Is there better way to divide dmv and harmonic? There's a two-way
 181 dependency between the modules. Guess there could be a third file that
 182 imports both the initialization and the actual EM stuff, while a file
 183 containing constants and classes could be imported by all others:
 184 : dmv.py imports dmv_EM.py imports dmv_classes.py
 185 : dmv.py imports dmv_inits.py imports dmv_classes.py
 186
 187 ** TOGROK Some (tagged) sentences are bound to come twice             :OPTIMIZE:
 188 Eg, first sort and count, so that the corpus
 189 [['nn','vbd','det','nn'],
 190  ['vbd','nn','det','nn'],
 191  ['nn','vbd','det','nn']]
 192 becomes
 193 [(['nn','vbd','det','nn'],2),
 194  (['vbd','nn','det','nn'],1)]
 195 and then in each loop through sentences, make sure we handle the
 196 frequency correctly.
 197
 198 Is there much to gain here?
 199
 200 ** TOGROK tags as numbers or tags as strings?                         :OPTIMIZE:
 201 Need to clean up the representation.
 202
 203 Stick with tag-strings in initialization then switch to numbers for
 204 IO-algorithm perhaps? Can probably afford more string-matching in
 205 initialization..
 206 * Expectation Maximation in IO/DMV-terms
 207 inner(s,t,LHS) calculates the expected number of trees headed by LHS
 208 from s to t (sentence positions). This uses the P_STOP and P_CHOOSE
 209 values, which have been conveniently distributed into CNF rules as
 210 probN and probA (non-adjacent and adjacent probabilites).
 211
 212 When re-estimating, we use the expected values from inner() to get new
 213 values for P_STOP and P_CHOOSE. When we've re-estimated for the entire
 214 corpus, we distribute P_STOP and P_CHOOSE into the CNF rules again, so
 215 that in the next round we use new probN and probA to find
 216 inner-probabilites.
 217
 218 The distribution of P_STOP and P_CHOOSE into CNF rules also happens in
 219 init_normalize() (here along with the creation of P_STOP and
 220 P_CHOOSE); P_STOP is used to create CNF rules where one branch of the
 221 rule is STOP, P_CHOOSE is used to create rules of the form
 222 : h  -> h  _a_
 223 : h_ -> h_ _a_
 224
 225 Since "adjacency" is not captured in regular CNF rules, we need two
 226 probabilites for each rule, and inner() has to know when to use which.
 227
 228 ** TODO Corpus access
 229 ** TOGROK sentences or rules as the "outer loop"?                     :OPTIMIZE:
 230 In regard to the E/M-step, finding P_{STOP}, P_{CHOOSE}.
 231
 232
 233 * Python-stuff
 234 - [[file:src/pseudo.py][pseudo.py]]
 235 - http://nltk.org/doc/en/structured-programming.html recursive dynamic
 236 - http://nltk.org/doc/en/advanced-parsing.html
 237 - http://jaynes.colorado.edu/PythonIdioms.html
 238
 239