DMVCCM.org

   1 # -*- coding: mule-utf-8-unix -*-
   2
   3 #+STARTUP: overview
   4 #+TAGS: OPTIMIZE PRETTIER
   5 #+STARTUP: hidestars
   6 #+TITLE: DMV/CCM -- todo-list / progress
   7 #+AUTHOR: Kevin Brubeck Unhammer
   8 #+EMAIL: K.BrubeckUnhammer at student uva nl
   9 #+OPTIONS: H:4 toc:3 ^:{}
  10 #+LANGUAGE: en
  11 #+SEQ_TODO: TOGROK TODO DONE
  12
  13 * dmvccm report and project
  14   DEADLINE: <2008-06-30 Mon>
  15 But absolute, extended, really-quite-dead-now deadline: August
  16 31...
  17 - [[file:src/dmv.py][dmv.py]]
  18 - [[file:src/io.py][io.py]]
  19 - [[file:src/harmonic.py::harmonic%20py%20initialization%20for%20dmv][harmonic.py]]
  20
  21
  22 * TODO [#A] P_STOP and P_CHOOSE for IO/EM
  23 [[file:src/dmv.py::DMV%20probabilities][dmv-P_STOP]]
  24 Remember: The P_{STOP} formula is upside-down (left-to-right also).
  25 (In the article..not the thesis)
  26
  27 Remember: Initialization makes some "short-cut" rules, these will also
  28 have to be updated along with the other P_{STOP} updates:
  29 - b[(NOBAR, n_{h}), 'h'] = 1.0       # always
  30 - b[(RBAR, n_{h}), 'h'] = h_.probA  # h_ is RBAR stop rule
  31 - b[(LRBAR, n_{h}), 'h'] = h_.probA * _ h_.probA
  32
  33 ** P_STOP formulas for various dir and adj:
  34 Assuming this:
  35 - P_{STOP}(STOP|h,L,non_adj) = \sum_{corpus} \sum_{s<loc(h)} \sum_{t}
  36   inner(s,t,_h_, ...) / \sum_{corpus} \sum_{s<loc(h)} \sum_{t}
  37   inner(s,t,h_, ...)
  38 - P_{STOP}(STOP|h,L,adj) = \sum_{corpus} \sum_{s=loc(h)} \sum_{t}
  39   inner(s,t,_h_, ...) / \sum_{corpus} \sum_{s=loc(h)} \sum_{t}
  40   inner(s,t,h_, ...)
  41 - P_{STOP}(STOP|h,R,non_adj) = \sum_{corpus} \sum_{s} \sum_{t>loc(h)}
  42   inner(s,t,_h_, ...) / \sum_{corpus} \sum_{s} \sum_{t>loc(h)}
  43   inner(s,t,h_, ...)
  44 - P_{STOP}(STOP|h,R,adj) = \sum_{corpus} \sum_{s} \sum_{t=loc(h)}
  45   inner(s,t,_h_, ...) / \sum_{corpus} \sum_{s} \sum_{t=loc(h)}
  46   inner(s,t,h_, ...)
  47
  48 (And P_{STOP}(-STOP|...) = 1 - P_{STOP}(STOP|...) )
  49 ** P_CHOOSE formula:
  50 Assuming this:
  51 - P_{CHOOSE}(a|h,L) = \sum_{corpus} \sum_{s<loc(h)} \sum_{t=loc(h)} \sum_{r<t}
  52   inner(s,r,_a_, ...) / \sum_{corpus} \sum_{s<loc(h)} \sum_{t=loc(h)}
  53   inner(s,t,h_,...)
  54   - or t >= loc(h)?
  55 - P_{CHOOSE}(a|h,R) = \sum_{corpus} \sum_{s=loc(h)} \sum_{r>s}
  56   inner(s,r,_a_,...) / \sum_{corpus} \sum_{s=loc(h)} \sum_{t}
  57   inner(s,t,h, ...)
  58   - or s => loc(h)?
  59
  60
  61 * TOGROK Find Most Probable Parse of given test sentence
  62 One way of finding the MPP is to construct all complete trees, so
  63 could this be used with P_STOP to make sure we only get counts from
  64 complete trees?
  65
  66 (But then, there are probably MPP algorithms that avoid computing _all_
  67 complete trees for a sentence...)
  68 * TOGROK Combine the models
  69 * TODO Get a dependency parsed version of WSJ to test with
  70 * Initialization
  71 [[file:~/Documents/Skole/V08/Probability/dmvccm/src/dmv.py::Initialization%20todo][dmv-inits]]
  72
  73 We do have to go through the corpus, since the probabilities are based
  74 on how far away in the sentence arguments are from their heads.
  75 ** TOGROK CCM Initialization
  76 P_{SPLIT} used here... how, again?
  77 ** DONE Separate initialization to another file?                      :PRETTIER:
  78    CLOSED: [2008-06-08 Sun 12:51]
  79 [[file:src/harmonic.py::harmonic%20py%20initialization%20for%20dmv][harmonic.py]]
  80 ** DONE DMV Initialization probabilities
  81 (from initialization frequency)
  82 ** DONE DMV Initialization frequencies
  83    CLOSED: [2008-05-27 Tue 20:04]
  84 *** P_STOP
  85 P_{STOP} is not well defined by K&M. One possible interpretation given
  86 the sentence [det nn vb nn] is
  87 : f_{STOP}( STOP|det, L, adj) +1
  88 : f_{STOP}(-STOP|det, L, adj) +0
  89 : f_{STOP}( STOP|det, L, non_adj) +1
  90 : f_{STOP}(-STOP|det, L, non_adj) +0
  91 : f_{STOP}( STOP|det, R, adj) +0
  92 : f_{STOP}(-STOP|det, R, adj) +1
  93 :
  94 : f_{STOP}( STOP|nn, L, adj) +0
  95 : f_{STOP}(-STOP|nn, L, adj) +1
  96 : f_{STOP}( STOP|nn, L, non_adj) +1  # since there's at least one to the left
  97 : f_{STOP}(-STOP|nn, L, non_adj) +0
  98 **** TODO tweak
  99 # <<pstoptweak>>
 100 :            f[head,  'STOP', 'LN'] += (i_h <= 1)     # first two words
 101 :            f[head, '-STOP', 'LN'] += (not i_h <= 1)
 102 :            f[head,  'STOP', 'LA'] += (i_h == 0)     # very first word
 103 :            f[head, '-STOP', 'LA'] += (not i_h == 0)
 104 :            f[head,  'STOP', 'RN'] += (i_h >= n - 2) # last two words
 105 :            f[head, '-STOP', 'RN'] += (not i_h >= n - 2)
 106 :            f[head,  'STOP', 'RA'] += (i_h == n - 1) # very last word
 107 :            f[head, '-STOP', 'RA'] += (not i_h == n - 1)
 108 vs
 109 :            # this one requires some additional rewriting since it
 110 :            # introduces divisions by zero
 111 :            f[head,  'STOP', 'LN'] += (i_h == 1)     # second word
 112 :            f[head, '-STOP', 'LN'] += (not i_h <= 1) # not first two
 113 :            f[head,  'STOP', 'LA'] += (i_h == 0)     # first word
 114 :            f[head, '-STOP', 'LA'] += (not i_h == 0) # not first
 115 :            f[head,  'STOP', 'RN'] += (i_h == n - 2)     # second-to-last
 116 :            f[head, '-STOP', 'RN'] += (not i_h >= n - 2) # not last two
 117 :            f[head,  'STOP', 'RA'] += (i_h == n - 1)     # last word
 118 :            f[head, '-STOP', 'RA'] += (not i_h == n - 1) # not last
 119 vs
 120 :            f[head,  'STOP', 'LN'] += (i_h == 1)     # second word
 121 :            f[head, '-STOP', 'LN'] += (not i_h == 1) # not second
 122 :            f[head,  'STOP', 'LA'] += (i_h == 0)     # first word
 123 :            f[head, '-STOP', 'LA'] += (not i_h == 0) # not first
 124 :            f[head,  'STOP', 'RN'] += (i_h == n - 2)     # second-to-last
 125 :            f[head, '-STOP', 'RN'] += (not i_h == n - 2) # not second-to-last
 126 :            f[head,  'STOP', 'RA'] += (i_h == n - 1)     # last word
 127 :            f[head, '-STOP', 'RA'] += (not i_h == n - 1) # not last
 128 vs
 129 "all words take the same number of arguments" interpreted as
 130 :for all heads:
 131 :    p_STOP(head, 'STOP', 'LN') = 0.3
 132 :    p_STOP(head, 'STOP', 'LA') = 0.5
 133 :    p_STOP(head, 'STOP', 'RN') = 0.4
 134 :    p_STOP(head, 'STOP', 'RA') = 0.7
 135 (which we easily may tweak in init_zeros())
 136 *** P_CHOOSE
 137 Go through the corpus, counting distances between heads and
 138 arguments. In [det nn vb nn], we give
 139 - f_{CHOOSE}(nn|det, R) +1/1 + C
 140 - f_{CHOOSE}(vb|det, R) +1/2 + C
 141 - f_{CHOOSE}(nn|det, R) +1/3 + C
 142   - If this were the full corpus, P_{CHOOSE}(nn|det, R) would have
 143     (1+1/3+2C) / sum_a f_{CHOOSE}(a|det, R)
 144
 145 The ROOT gets "each argument with equal probability", so in a sentence
 146 of three words, 1/3 for each (in [nn vb nn], 'nn' gets 2/3). Basically
 147 just a frequency count of the corpus...
 148 * [#C] Deferred
 149 ** TODO if loc_h == t, no need to try right-attachment rules          :OPTIMIZE:
 150 ** TODO if loc_h == s, no need to try left-attachment rules           :OPTIMIZE:
 151 ** TODO when reestimating P_STOP etc, remove rules with p < epsilon   :OPTIMIZE:
 152 ** TODO inner_dmv, short ranges and impossible attachment             :OPTIMIZE:
 153 If s-t <= 2, there can be only one attachment below, so don't recurse
 154 with both Lattach=True and Rattach=True.
 155
 156 If s-t <= 1, there can be no attachment below, so only recurse with
 157 Lattach=False, Rattach=False.
 158
 159 Put this in the loop under rewrite rules (could also do it in the STOP
 160 section, but that would only have an effect on very short sentences).
 161 ** TODO clean up the module files                                     :PRETTIER:
 162 Is there better way to divide dmv and harmonic? There's a two-way
 163 dependency between the modules. Guess there could be a third file that
 164 imports both the initialization and the actual EM stuff, while a file
 165 containing constants and classes could be imported by all others:
 166 : dmv.py imports dmv_EM.py imports dmv_classes.py
 167 : dmv.py imports dmv_inits.py imports dmv_classes.py
 168
 169 ** TOGROK Some (tagged) sentences are bound to come twice             :OPTIMIZE:
 170 Eg, first sort and count, so that the corpus
 171 [['nn','vbd','det','nn'],
 172  ['vbd','nn','det','nn'],
 173  ['nn','vbd','det','nn']]
 174 becomes
 175 [(['nn','vbd','det','nn'],2),
 176  (['vbd','nn','det','nn'],1)]
 177 and then in each loop through sentences, make sure we handle the
 178 frequency correctly.
 179
 180 Is there much to gain here?
 181
 182 ** TOGROK tags as numbers or tags as strings?                         :OPTIMIZE:
 183 Need to clean up the representation.
 184
 185 Stick with tag-strings in initialization then switch to numbers for
 186 IO-algorithm perhaps? Can probably afford more string-matching in
 187 initialization..
 188 ** DONE inner_dmv() should disregard rules with heads not in sent     :OPTIMIZE:
 189    CLOSED: [2008-06-08 Sun 10:18]
 190 If the sentence is "nn vbd det nn", we should not even look at rules
 191 where
 192 : rule.head() not in "nn vbd det nn".split()
 193 This is ruled out by getting rules from g.rules(LHS, sent).
 194
 195 Also, we optimize this further by saying we don't even recurse into
 196 attachment rules where
 197 : rule.head() not in sent[ s :r+1]
 198 : rule.head() not in sent[r+1:t+1]
 199 meaning, if we're looking at the span "vbd det", we only use
 200 attachment rules where both daughters are members of ['vbd','det']
 201 (although we don't (yet) care about removing rules that rewrite to the
 202 same tag if there are no duplicate tags in the span, etc., that would
 203 be a lot of trouble for little potential gain).
 204 * Adjacency and combining it with inner()
 205 Each DMV_Rule now has both a probN and a probA, for
 206 adjacencies. inner() needs the correct one in each case.
 207
 208 Adjacency gives a problem with duplicate words/tags, eg. in the
 209 sentence "a a b". If this has the dependency structure b->a_{0}->a_{1},
 210 then b is non-adjacent to a_{0} and should use probN (for the LRStop and
 211 the attachment of a_{0}), while the other rules should all use
 212 probA. But within the e(0,2,b) we can't just say "oh, a has index 0
 213 so it's not adjacent to 2", since there's also an a at index 1, and
 214 there's also a dependency structure b->a_{1}->a_{0} for that. We want
 215 both. And in possibly much more complex versions.
 216
 217 So now we call inner() for each location of a head, and on each
 218 terminal, loc_h must equal s (and t). In the recursive attachment
 219 calls, we use the locations (sentence indices) of words to the left or
 220 right of the head in calls to inner(). loc_h lets us check whether we
 221 need probN or probA.
 222
 223 In each inner() call, loc_h is the location of the root of this
 224 dependency structure.
 225 ** Possible alternate type of adjacency
 226 K&M's adjacency is just whether or not an argument has been generated
 227 in the current direction yet. One could also make a stronger type of
 228 adjacency, where h and a are not adjacent if b is in between, eg. with
 229 the sentence "a b h" and the structure ((h->a), (a->b)), h is
 230 K&M-adjacent to a, but not next to a, since b is in between. It's easy
 231 to check this type of adjacency in inner(), but it needs new rules for
 232 P_STOP reestimation.
 233 * Expectation Maximation in IO/DMV-terms
 234 inner(s,t,LHS) calculates the expected number of trees headed by LHS
 235 from s to t (sentence positions). This uses the P_STOP and P_CHOOSE
 236 values, which have been conveniently distributed into CNF rules as
 237 probN and probA (non-adjacent and adjacent probabilites).
 238
 239 When re-estimating, we use the expected values from inner() to get new
 240 values for P_STOP and P_CHOOSE. When we've re-estimated for the entire
 241 corpus, we distribute P_STOP and P_CHOOSE into the CNF rules again, so
 242 that in the next round we use new probN and probA to find
 243 inner-probabilites.
 244
 245 The distribution of P_STOP and P_CHOOSE into CNF rules also happens in
 246 init_normalize() (here along with the creation of P_STOP and
 247 P_CHOOSE); P_STOP is used to create CNF rules where one branch of the
 248 rule is STOP, P_CHOOSE is used to create rules of the form
 249 : h  -> h  _a_
 250 : h_ -> h_ _a_
 251
 252 Since "adjacency" is not captured in regular CNF rules, we need two
 253 probabilites for each rule, and inner() has to know when to use which.
 254
 255 ** TODO Corpus access
 256 ** TOGROK sentences or rules as the "outer loop"?                     :OPTIMIZE:
 257 In regard to the E/M-step, finding P_{STOP}, P_{CHOOSE}.
 258
 259
 260 * Python-stuff
 261 - [[file:src/pseudo.py][pseudo.py]]
 262 - http://nltk.org/doc/en/structured-programming.html recursive dynamic
 263 - http://nltk.org/doc/en/advanced-parsing.html
 264 - http://jaynes.colorado.edu/PythonIdioms.html
 265
 266
 267
 268 * Git
 269 Setting up a new project:
 270 : git init
 271 : git add .
 272 : git commit -m "first release"
 273
 274 Later on: (=-a= does =git rm= and =git add= automatically)
 275 : git init
 276 : git commit -a -m "some subsequent release"
 277
 278 Then push stuff up to the remote server:
 279 : git push git+ssh://username@repo.or.cz/srv/git/dmvccm.git master
 280
 281 (=eval `ssh-agent`= and =ssh-add= to avoid having to type in keyphrase all
 282 the time)
 283
 284 Make a copy of the (remote) master branch:
 285 : git clone git://repo.or.cz/dmvccm.git
 286
 287 Make and name a new branch in this folder
 288 : git checkout -b mybranch
 289
 290 To save changes in =mybranch=:
 291 : git commit -a
 292
 293 Go back to the master branch (uncommitted changes from =mybranch= are
 294 carried over):
 295 : git checkout master
 296
 297 Try out:
 298 : git add --interactive
 299
 300 Good tutorial:
 301 http://www-cs-students.stanford.edu/~blynn//gitmagic/