DMVCCM.org.~11~

   1 # -*- coding: mule-utf-8-unix -*-
   2 #+STARTUP: overview
   3 #+TAGS: OPTIMIZE PRETTIER
   4 #+STARTUP: hidestars
   5 #+TITLE: DMV/CCM
   6 #+AUTHOR: Kevin Brubeck Unhammer
   7 #+EMAIL: K.BrubeckUnhammer at student uva nl
   8 #+LANGUAGE: en
   9 #+SEQ_TODO: TOGROK TODO DONE
  10 #+OPTIONS: H:4 toc:3 ^:{}
  11
  12 * dmvccm
  13   DEADLINE: <2008-06-30 Mon>
  14 (But absolute, extended, really-quite-dead-now deadline: August 31...)
  15 [[file:src/dmv.py][dmv.py]]
  16 [[file:src/io.py][io.py]]
  17 ** TODO Adjacency and combining it with inner()
  18 Each DMV_Rule now has both a probN and a probA, for
  19 adjacencies. inner() needs the correct one in each case.
  20
  21 Adjacency gives a problem with duplicate words/tags, eg. in the
  22 sentence "a a b". If this has the dependency structure b->a_{0}->a_{1},
  23 then b is non-adjacent to a_{0} and should use probN (for the LRStop and
  24 the attachment of a_{0}), while the other rules should all use
  25 probA. But within the e(0,2,b) we can't just say "oh, a has index 0
  26 so it's not adjacent to 2", since there's also an a at index 1, and
  27 there's also a dependency structure b->a_{1}->a_{0} for that. We want
  28 both. And in possibly much more complex versions.
  29
  30 Ideas:
  31 - I first thought of decorating the individual words/tags in a
  32   sentence with their indices, and perhaps just duplicating the
  33   relevant rules (one for each index of the duplicate tags). But this
  34   gives an explosion in attachment rules (although a contained
  35   explosion, within the rules used in a sentence; but most sentences
  36   will have at least two NN's so it will be a problem).
  37 - Then, I had a /brilliant/ idea. Just let e(), the helper function of
  38   inner(), parametrize for an extra pair of boolean values for whether
  39   or not we've attached anything to the left or right yet ("yet"
  40   meaning "below"). So now, e() has a chart of the form [s, t, LHS,
  41   Lattach, Rattach], and of course e(s,t,LHS) is the sum of the four
  42   possible values for (Lattach,Rattach). This makes e() lots more
  43   complex and DMV-specific though, so it's been rewritten in
  44   inner_dmv() in dmv.py.
  45 *** TODO document this adjacency stuff better
  46 *** TODO test and debug my brilliant idea
  47 *** DONE implement my brilliant idea.
  48     CLOSED: [2008-06-01 Sun 17:19]
  49 [[file:src/dmv.py::def%20e%20s%20t%20LHS%20Lattach%20Rattach][e(sti) in dmv.py]]
  50
  51 *** DONE [#A] test inner() on sentences with duplicate words
  52 Works with eg. the sentence "h h h"
  53
  54
  55 ** TODO [#A] P_STOP for IO/EM
  56 [[file:src/dmv.py::DMV%20probabilities][dmv-P_STOP]]
  57 Remember: The P_{STOP} formula is upside-down (left-to-right also).
  58 (In the article..not the thesis)
  59
  60 Remember: Initialization makes some "short-cut" rules, these will also
  61 have to be updated along with the other P_{STOP} updates:
  62 - b[(NOBAR, n_{h}), 'h'] = 1.0       # always
  63 - b[(RBAR, n_{h}), 'h'] = h_.probA  # h_ is RBAR stop rule
  64 - b[(LRBAR, n_{h}), 'h'] = h_.probA * _ h_.probA
  65
  66 *** How is the P_STOP formula different given other values for dir and adj?
  67 (Presumably, the P_{STOP} formula where STOP is True is just the
  68 rule-probability of _ h_ -> STOP h_ or h_ -> h STOP, but how does
  69 adjacency fit in here?)
  70
  71 (And P_{STOP}(-STOP|...) = 1 - P_{STOP}(STOP|...) )
  72 ** TODO P_CHOOSE for IO/EM
  73 Write the formulas! should be easy?
  74 ** Initialization
  75 [[file:~/Documents/Skole/V08/Probability/dmvccm/src/dmv.py::Initialization%20todo][dmv-inits]]
  76
  77 We do have to go through the corpus, since the probabilities are based
  78 on how far away in the sentence arguments are from their heads.
  79 *** TODO Separate initialization to another file?                     :PRETTIER:
  80 (It's rather messy.)
  81 *** TOGROK CCM Initialization
  82 P_{SPLIT} used here... how, again?
  83 *** DONE DMV Initialization probabilities
  84 (from initialization frequency)
  85 *** DONE DMV Initialization frequencies
  86     CLOSED: [2008-05-27 Tue 20:04]
  87 **** P_STOP
  88 P_{STOP} is not well defined by K&M. One possible interpretation given
  89 the sentence [det nn vb nn] is
  90 : f_{STOP}( STOP|det, L, adj) +1
  91 : f_{STOP}(-STOP|det, L, adj) +0
  92 : f_{STOP}( STOP|det, L, non_adj) +1
  93 : f_{STOP}(-STOP|det, L, non_adj) +0
  94 : f_{STOP}( STOP|det, R, adj) +0
  95 : f_{STOP}(-STOP|det, R, adj) +1
  96 :
  97 : f_{STOP}( STOP|nn, L, adj) +0
  98 : f_{STOP}(-STOP|nn, L, adj) +1
  99 : f_{STOP}( STOP|nn, L, non_adj) +1  # since there's at least one to the left
 100 : f_{STOP}(-STOP|nn, L, non_adj) +0
 101 ***** TODO tweak
 102 # <<pstoptweak>>
 103 :            f[head,  'STOP', 'LN'] += (i_h <= 1)     # first two words
 104 :            f[head, '-STOP', 'LN'] += (not i_h <= 1)
 105 :            f[head,  'STOP', 'LA'] += (i_h == 0)     # very first word
 106 :            f[head, '-STOP', 'LA'] += (not i_h == 0)
 107 :            f[head,  'STOP', 'RN'] += (i_h >= n - 2) # last two words
 108 :            f[head, '-STOP', 'RN'] += (not i_h >= n - 2)
 109 :            f[head,  'STOP', 'RA'] += (i_h == n - 1) # very last word
 110 :            f[head, '-STOP', 'RA'] += (not i_h == n - 1)
 111 vs
 112 :            # this one requires some additional rewriting since it
 113 :            # introduces divisions by zero
 114 :            f[head,  'STOP', 'LN'] += (i_h == 1)     # second word
 115 :            f[head, '-STOP', 'LN'] += (not i_h <= 1) # not first two
 116 :            f[head,  'STOP', 'LA'] += (i_h == 0)     # first word
 117 :            f[head, '-STOP', 'LA'] += (not i_h == 0) # not first
 118 :            f[head,  'STOP', 'RN'] += (i_h == n - 2)     # second-to-last
 119 :            f[head, '-STOP', 'RN'] += (not i_h >= n - 2) # not last two
 120 :            f[head,  'STOP', 'RA'] += (i_h == n - 1)     # last word
 121 :            f[head, '-STOP', 'RA'] += (not i_h == n - 1) # not last
 122 vs
 123 :            f[head,  'STOP', 'LN'] += (i_h == 1)     # second word
 124 :            f[head, '-STOP', 'LN'] += (not i_h == 1) # not second
 125 :            f[head,  'STOP', 'LA'] += (i_h == 0)     # first word
 126 :            f[head, '-STOP', 'LA'] += (not i_h == 0) # not first
 127 :            f[head,  'STOP', 'RN'] += (i_h == n - 2)     # second-to-last
 128 :            f[head, '-STOP', 'RN'] += (not i_h == n - 2) # not second-to-last
 129 :            f[head,  'STOP', 'RA'] += (i_h == n - 1)     # last word
 130 :            f[head, '-STOP', 'RA'] += (not i_h == n - 1) # not last
 131 vs
 132 "all words take the same number of arguments" interpreted as
 133 :for all heads:
 134 :    p_STOP(head, 'STOP', 'LN') = 0.3
 135 :    p_STOP(head, 'STOP', 'LA') = 0.5
 136 :    p_STOP(head, 'STOP', 'RN') = 0.4
 137 :    p_STOP(head, 'STOP', 'RA') = 0.7
 138 (which we easily may tweak in init_zeros())
 139 **** P_CHOOSE
 140 Go through the corpus, counting distances between heads and
 141 arguments. In [det nn vb nn], we give
 142 - f_{CHOOSE}(nn|det, R) +1/1 + C
 143 - f_{CHOOSE}(vb|det, R) +1/2 + C
 144 - f_{CHOOSE}(nn|det, R) +1/3 + C
 145   - If this were the full corpus, P_{CHOOSE}(nn|det, R) would have
 146     (1+1/3+2C) / sum_a f_{CHOOSE}(a|det, R)
 147
 148 The ROOT gets "each argument with equal probability", so in a sentence
 149 of three words, 1/3 for each (in [nn vb nn], 'nn' gets 2/3). Basically
 150 just a frequency count of the corpus...
 151 ** [#C] Deferred
 152 *** TODO inner_dmv() should disregard rules with heads not in sent    :OPTIMIZE:
 153 If the sentence is "nn vbd det nn", we should not even look at rules
 154 where
 155 : rule.head() not in "nn vbd det nn".split()
 156 Also, we optimize this further by saying we don't even look at rules
 157 where
 158 : rule.head() not in sent[ s :r+1]
 159 : rule.head() not in sent[r+1:t+1]
 160 meaning, if we're looking at the span "vbd det", we only use
 161 attachment rules where both daughters are members of ['vbd','det']
 162 (although we don't (yet) care about removing rules that rewrite to the
 163 same tag if there are no duplicate tags in the span, etc., that would
 164 be a lot of trouble for little potential gain).
 165
 166 Since STOP rules necessarily come from attachment rules, this ensures
 167 we don't end up with pointless STOP rules either, and if we /started/
 168 with a pointless, we nevertheless stop at the first rewrite-round
 169 (although we could also make that check in the first call to e()).
 170 *** TODO when reestimating P_STOP etc, remove rules with p < epsilon  :OPTIMIZE:
 171 *** TODO inner_dmv, short ranges and impossible attachment            :OPTIMIZE:
 172 If s-t <= 2, there can be only one attachment below, so don't recurse
 173 with both Lattach=True and Rattach=True.
 174
 175 If s-t <= 1, there can be no attachment below, so only recurse with
 176 Lattach=False, Rattach=False.
 177
 178 Put this in the loop under rewrite rules (could also do it in the STOP
 179 section, but that would only have an effect on very short sentences).
 180 *** TOGROK Some (tagged) sentences are bound to come twice            :OPTIMIZE:
 181 Eg, first sort and count, so that the corpus
 182 [['nn','vbd','det','nn'],
 183  ['vbd','nn','det','nn'],
 184  ['nn','vbd','det','nn']]
 185 becomes
 186 [(['nn','vbd','det','nn'],2),
 187  (['vbd','nn','det','nn'],1)]
 188 and then in each loop through sentences, make sure we handle the
 189 frequency correctly.
 190
 191 Is there much to gain here?
 192
 193 *** TOGROK tags as numbers or tags as strings?                        :OPTIMIZE:
 194 Need to clean up the representation.
 195
 196 Stick with tag-strings in initialization then switch to numbers for
 197 IO-algorithm perhaps? Can probably afford more string-matching in
 198 initialization..
 199 ** Expectation Maximation in IO/DMV-terms
 200 inner(s,t,LHS) calculates the expected number of trees headed by LHS
 201 from s to t (sentence positions). This uses the P_STOP and P_CHOOSE
 202 values, which have been conveniently distributed into CNF rules as
 203 probN and probA (non-adjacent and adjacent probabilites).
 204
 205 When re-estimating, we use the expected values from inner() to get new
 206 values for P_STOP and P_CHOOSE. When we've re-estimated for the entire
 207 corpus, we distribute P_STOP and P_CHOOSE into the CNF rules again, so
 208 that in the next round we use new probN and probA to find
 209 inner-probabilites.
 210
 211 The distribution of P_STOP and P_CHOOSE into CNF rules also happens in
 212 init_normalize() (here along with the creation of P_STOP and
 213 P_CHOOSE); P_STOP is used to create CNF rules where one branch of the
 214 rule is STOP, P_CHOOSE is used to create rules of the form
 215 : h  -> h  _a_
 216 : h_ -> h_ _a_
 217
 218 Since "adjacency" is not captured in regular CNF rules, we need two
 219 probabilites for each rule, and inner() has to know when to use which.
 220
 221 *** TODO Corpus access
 222 *** TOGROK sentences or rules as the "outer loop"?                    :OPTIMIZE:
 223 In regard to the E/M-step, finding P_{STOP}, P_{CHOOSE}.
 224
 225
 226 * Python-stuff
 227 - [[file:src/pseudo.py][pseudo.py]]
 228 - http://nltk.org/doc/en/structured-programming.html recursive dynamic
 229 - http://nltk.org/doc/en/advanced-parsing.html
 230 - http://jaynes.colorado.edu/PythonIdioms.html
 231
 232