DMVCCM.org

   1 # -*- coding: mule-utf-8-unix -*-
   2
   3 #+STARTUP: overview
   4 #+TAGS: OPTIMIZE PRETTIER
   5 #+STARTUP: hidestars
   6 #+TITLE: DMV/CCM -- todo-list / progress
   7 #+AUTHOR: Kevin Brubeck Unhammer
   8 #+EMAIL: K.BrubeckUnhammer at student uva nl
   9 #+OPTIONS: H:4 toc:3 ^:{}
  10 #+LANGUAGE: en
  11 #+SEQ_TODO: TOGROK TODO DONE
  12
  13 * dmvccm report and project
  14   DEADLINE: <2008-06-30 Mon>
  15 But absolute, extended, really-quite-dead-now deadline: August
  16 31...
  17 - [[file:src/dmv.py][dmv.py]]
  18 - [[file:src/io.py][io.py]]
  19 - [[file:src/harmonic.py::harmonic%20py%20initialization%20for%20dmv][harmonic.py]]
  20 * TODO Adjacency and combining it with inner()
  21 Each DMV_Rule now has both a probN and a probA, for
  22 adjacencies. inner() needs the correct one in each case.
  23
  24 Adjacency gives a problem with duplicate words/tags, eg. in the
  25 sentence "a a b". If this has the dependency structure b->a_{0}->a_{1},
  26 then b is non-adjacent to a_{0} and should use probN (for the LRStop and
  27 the attachment of a_{0}), while the other rules should all use
  28 probA. But within the e(0,2,b) we can't just say "oh, a has index 0
  29 so it's not adjacent to 2", since there's also an a at index 1, and
  30 there's also a dependency structure b->a_{1}->a_{0} for that. We want
  31 both. And in possibly much more complex versions.
  32
  33 Ideas:
  34 - I first thought of decorating the individual words/tags in a
  35   sentence with their indices, and perhaps just duplicating the
  36   relevant rules (one for each index of the duplicate tags). But this
  37   gives an explosion in attachment rules (although a contained
  38   explosion, within the rules used in a sentence; but most sentences
  39   will have at least two NN's so it will be a problem).
  40 - Then, I had a /brilliant/ idea. Just let e(), the helper function of
  41   inner(), parametrize for an extra pair of boolean values for whether
  42   or not we've attached anything to the left or right yet ("yet"
  43   meaning "below"). So now, e() has a chart of the form [s, t, LHS,
  44   Lattach, Rattach], and of course e(s,t,LHS) is the sum of the four
  45   possible values for (Lattach,Rattach). This makes e() lots more
  46   complex and DMV-specific though, so it's been rewritten in
  47   inner_dmv() in dmv.py.
  48 - But, the problem is that we still need to be able get inner
  49   probabilites for only trees where the h at location loc_h is the
  50   root of the dependency. The previous idea does not give that. So I
  51   made /another/ version of inner() that parametrizes for loc_h in the
  52   chart (instead of Lattach and Rattach). loc_h also lets us check
  53   whether we need probN or probA, in a much simpler fashion than the
  54   Lattach / Rattach version.
  55 ** TODO document this adjacency stuff better
  56 ** TODO [#A] test and debug my brilliant idea
  57 *** Works on simple 'h h' sentence, but not with P_STOP, why?
  58 ** DONE implement my brilliant idea.
  59     CLOSED: [2008-06-01 Sun 17:19]
  60 [[file:src/dmv.py::def%20e%20s%20t%20LHS%20Lattach%20Rattach][e(sti) in dmv.py]]
  61
  62 ** DONE [#A] test inner() on sentences with duplicate words
  63 Works with eg. the sentence "h h h"
  64
  65
  66 * TODO [#A] P_STOP for IO/EM
  67 [[file:src/dmv.py::DMV%20probabilities][dmv-P_STOP]]
  68 Remember: The P_{STOP} formula is upside-down (left-to-right also).
  69 (In the article..not the thesis)
  70
  71 Remember: Initialization makes some "short-cut" rules, these will also
  72 have to be updated along with the other P_{STOP} updates:
  73 - b[(NOBAR, n_{h}), 'h'] = 1.0       # always
  74 - b[(RBAR, n_{h}), 'h'] = h_.probA  # h_ is RBAR stop rule
  75 - b[(LRBAR, n_{h}), 'h'] = h_.probA * _ h_.probA
  76
  77 ** How is the P_STOP formula different given other values for dir and adj?
  78 Assuming this:
  79 - P_{STOP}(STOP|h,L,non_adj) = \sum_{corpus} \sum_{s<loc(h)} \sum_{t}
  80   inner(s,t,(LRBAR,h)...) / \sum_{corpus} \sum_{s<loc(h)} \sum_{t} inner(s,t,(RBAR,h)...)
  81 - P_{STOP}(STOP|h,L,adj) = \sum_{corpus} \sum_{s=loc(h)} \sum_{t}
  82   inner(s,t,(LRBAR,h)...) / \sum_{corpus} \sum_{s=loc(h)} \sum_{t} inner(s,t,(RBAR,h)...)
  83 - P_{STOP}(STOP|h,R,non_adj) = \sum_{corpus} \sum_{s} \sum_{t>loc(h)}
  84   inner(s,t,(LRBAR,h)...) / \sum_{corpus} \sum_{s} \sum_{t>loc(h)} inner(s,t,(RBAR,h)...)
  85 - P_{STOP}(STOP|h,R,adj) = \sum_{corpus} \sum_{s} \sum_{t=loc(h)}
  86   inner(s,t,(LRBAR,h)...) / \sum_{corpus} \sum_{s} \sum_{t=loc(h)} inner(s,t,(RBAR,h)...)
  87
  88
  89
  90 (And P_{STOP}(-STOP|...) = 1 - P_{STOP}(STOP|...) )
  91 * TODO P_CHOOSE for IO/EM
  92 Write the formulas! should be easy?
  93 * Initialization
  94 [[file:~/Documents/Skole/V08/Probability/dmvccm/src/dmv.py::Initialization%20todo][dmv-inits]]
  95
  96 We do have to go through the corpus, since the probabilities are based
  97 on how far away in the sentence arguments are from their heads.
  98 ** TODO Separate initialization to another file?                      :PRETTIER:
  99 (It's rather messy.)
 100 ** TOGROK CCM Initialization
 101 P_{SPLIT} used here... how, again?
 102 ** DONE DMV Initialization probabilities
 103 (from initialization frequency)
 104 ** DONE DMV Initialization frequencies
 105    CLOSED: [2008-05-27 Tue 20:04]
 106 *** P_STOP
 107 P_{STOP} is not well defined by K&M. One possible interpretation given
 108 the sentence [det nn vb nn] is
 109 : f_{STOP}( STOP|det, L, adj) +1
 110 : f_{STOP}(-STOP|det, L, adj) +0
 111 : f_{STOP}( STOP|det, L, non_adj) +1
 112 : f_{STOP}(-STOP|det, L, non_adj) +0
 113 : f_{STOP}( STOP|det, R, adj) +0
 114 : f_{STOP}(-STOP|det, R, adj) +1
 115 :
 116 : f_{STOP}( STOP|nn, L, adj) +0
 117 : f_{STOP}(-STOP|nn, L, adj) +1
 118 : f_{STOP}( STOP|nn, L, non_adj) +1  # since there's at least one to the left
 119 : f_{STOP}(-STOP|nn, L, non_adj) +0
 120 **** TODO tweak
 121 # <<pstoptweak>>
 122 :            f[head,  'STOP', 'LN'] += (i_h <= 1)     # first two words
 123 :            f[head, '-STOP', 'LN'] += (not i_h <= 1)
 124 :            f[head,  'STOP', 'LA'] += (i_h == 0)     # very first word
 125 :            f[head, '-STOP', 'LA'] += (not i_h == 0)
 126 :            f[head,  'STOP', 'RN'] += (i_h >= n - 2) # last two words
 127 :            f[head, '-STOP', 'RN'] += (not i_h >= n - 2)
 128 :            f[head,  'STOP', 'RA'] += (i_h == n - 1) # very last word
 129 :            f[head, '-STOP', 'RA'] += (not i_h == n - 1)
 130 vs
 131 :            # this one requires some additional rewriting since it
 132 :            # introduces divisions by zero
 133 :            f[head,  'STOP', 'LN'] += (i_h == 1)     # second word
 134 :            f[head, '-STOP', 'LN'] += (not i_h <= 1) # not first two
 135 :            f[head,  'STOP', 'LA'] += (i_h == 0)     # first word
 136 :            f[head, '-STOP', 'LA'] += (not i_h == 0) # not first
 137 :            f[head,  'STOP', 'RN'] += (i_h == n - 2)     # second-to-last
 138 :            f[head, '-STOP', 'RN'] += (not i_h >= n - 2) # not last two
 139 :            f[head,  'STOP', 'RA'] += (i_h == n - 1)     # last word
 140 :            f[head, '-STOP', 'RA'] += (not i_h == n - 1) # not last
 141 vs
 142 :            f[head,  'STOP', 'LN'] += (i_h == 1)     # second word
 143 :            f[head, '-STOP', 'LN'] += (not i_h == 1) # not second
 144 :            f[head,  'STOP', 'LA'] += (i_h == 0)     # first word
 145 :            f[head, '-STOP', 'LA'] += (not i_h == 0) # not first
 146 :            f[head,  'STOP', 'RN'] += (i_h == n - 2)     # second-to-last
 147 :            f[head, '-STOP', 'RN'] += (not i_h == n - 2) # not second-to-last
 148 :            f[head,  'STOP', 'RA'] += (i_h == n - 1)     # last word
 149 :            f[head, '-STOP', 'RA'] += (not i_h == n - 1) # not last
 150 vs
 151 "all words take the same number of arguments" interpreted as
 152 :for all heads:
 153 :    p_STOP(head, 'STOP', 'LN') = 0.3
 154 :    p_STOP(head, 'STOP', 'LA') = 0.5
 155 :    p_STOP(head, 'STOP', 'RN') = 0.4
 156 :    p_STOP(head, 'STOP', 'RA') = 0.7
 157 (which we easily may tweak in init_zeros())
 158 *** P_CHOOSE
 159 Go through the corpus, counting distances between heads and
 160 arguments. In [det nn vb nn], we give
 161 - f_{CHOOSE}(nn|det, R) +1/1 + C
 162 - f_{CHOOSE}(vb|det, R) +1/2 + C
 163 - f_{CHOOSE}(nn|det, R) +1/3 + C
 164   - If this were the full corpus, P_{CHOOSE}(nn|det, R) would have
 165     (1+1/3+2C) / sum_a f_{CHOOSE}(a|det, R)
 166
 167 The ROOT gets "each argument with equal probability", so in a sentence
 168 of three words, 1/3 for each (in [nn vb nn], 'nn' gets 2/3). Basically
 169 just a frequency count of the corpus...
 170 * [#C] Deferred
 171 ** TODO inner_dmv() should disregard rules with heads not in sent     :OPTIMIZE:
 172 If the sentence is "nn vbd det nn", we should not even look at rules
 173 where
 174 : rule.head() not in "nn vbd det nn".split()
 175 This is ruled out by getting rules from g.rules(LHS, sent).
 176
 177 Also, we optimize this further by saying we don't even recurse into
 178 attachment rules where
 179 : rule.head() not in sent[ s :r+1]
 180 : rule.head() not in sent[r+1:t+1]
 181 meaning, if we're looking at the span "vbd det", we only use
 182 attachment rules where both daughters are members of ['vbd','det']
 183 (although we don't (yet) care about removing rules that rewrite to the
 184 same tag if there are no duplicate tags in the span, etc., that would
 185 be a lot of trouble for little potential gain).
 186 ** TODO when reestimating P_STOP etc, remove rules with p < epsilon   :OPTIMIZE:
 187 ** TODO inner_dmv, short ranges and impossible attachment             :OPTIMIZE:
 188 If s-t <= 2, there can be only one attachment below, so don't recurse
 189 with both Lattach=True and Rattach=True.
 190
 191 If s-t <= 1, there can be no attachment below, so only recurse with
 192 Lattach=False, Rattach=False.
 193
 194 Put this in the loop under rewrite rules (could also do it in the STOP
 195 section, but that would only have an effect on very short sentences).
 196 ** TODO clean up the module files                                     :PRETTIER:
 197 Is there better way to divide dmv and harmonic? There's a two-way
 198 dependency between the modules. Guess there could be a third file that
 199 imports both the initialization and the actual EM stuff, while a file
 200 containing constants and classes could be imported by all others:
 201 : dmv.py imports dmv_EM.py imports dmv_classes.py
 202 : dmv.py imports dmv_inits.py imports dmv_classes.py
 203
 204 ** TOGROK Some (tagged) sentences are bound to come twice             :OPTIMIZE:
 205 Eg, first sort and count, so that the corpus
 206 [['nn','vbd','det','nn'],
 207  ['vbd','nn','det','nn'],
 208  ['nn','vbd','det','nn']]
 209 becomes
 210 [(['nn','vbd','det','nn'],2),
 211  (['vbd','nn','det','nn'],1)]
 212 and then in each loop through sentences, make sure we handle the
 213 frequency correctly.
 214
 215 Is there much to gain here?
 216
 217 ** TOGROK tags as numbers or tags as strings?                         :OPTIMIZE:
 218 Need to clean up the representation.
 219
 220 Stick with tag-strings in initialization then switch to numbers for
 221 IO-algorithm perhaps? Can probably afford more string-matching in
 222 initialization..
 223 * Expectation Maximation in IO/DMV-terms
 224 inner(s,t,LHS) calculates the expected number of trees headed by LHS
 225 from s to t (sentence positions). This uses the P_STOP and P_CHOOSE
 226 values, which have been conveniently distributed into CNF rules as
 227 probN and probA (non-adjacent and adjacent probabilites).
 228
 229 When re-estimating, we use the expected values from inner() to get new
 230 values for P_STOP and P_CHOOSE. When we've re-estimated for the entire
 231 corpus, we distribute P_STOP and P_CHOOSE into the CNF rules again, so
 232 that in the next round we use new probN and probA to find
 233 inner-probabilites.
 234
 235 The distribution of P_STOP and P_CHOOSE into CNF rules also happens in
 236 init_normalize() (here along with the creation of P_STOP and
 237 P_CHOOSE); P_STOP is used to create CNF rules where one branch of the
 238 rule is STOP, P_CHOOSE is used to create rules of the form
 239 : h  -> h  _a_
 240 : h_ -> h_ _a_
 241
 242 Since "adjacency" is not captured in regular CNF rules, we need two
 243 probabilites for each rule, and inner() has to know when to use which.
 244
 245 ** TODO Corpus access
 246 ** TOGROK sentences or rules as the "outer loop"?                     :OPTIMIZE:
 247 In regard to the E/M-step, finding P_{STOP}, P_{CHOOSE}.
 248
 249
 250 * Python-stuff
 251 - [[file:src/pseudo.py][pseudo.py]]
 252 - http://nltk.org/doc/en/structured-programming.html recursive dynamic
 253 - http://nltk.org/doc/en/advanced-parsing.html
 254 - http://jaynes.colorado.edu/PythonIdioms.html
 255
 256
 257
 258 * Git
 259 Setting up a new project:
 260 : git init
 261 : git add .
 262 : git commit -m "first release"
 263
 264 Later on: (=-a= does =git rm= and =git add= automatically)
 265 : git init
 266 : git commit -a -m "some subsequent release"
 267
 268 Then push stuff up to the remote server:
 269 : git push git+ssh://username@repo.or.cz/srv/git/dmvccm.git master
 270
 271 (=eval `ssh-agent`= and =ssh-add= to avoid having to type in keyphrase all
 272 the time)
 273
 274 Make a copy of the (remote) master branch:
 275 : git clone git://repo.or.cz/dmvccm.git
 276
 277 Make and name a new branch in this folder
 278 : git checkout -b mybranch
 279
 280 To save changes in =mybranch=:
 281 : git commit -a
 282
 283 Go back to the master branch (uncommitted changes from =mybranch= are
 284 carried over):
 285 : git checkout master
 286
 287 Try out:
 288 : git add --interactive
 289
 290 Good tutorial:
 291 http://www-cs-students.stanford.edu/~blynn//gitmagic/