DMVCCM.org

   1 # -*- coding: mule-utf-8-unix -*-
   2
   3 #+STARTUP: overview
   4 #+TAGS: OPTIMIZE PRETTIER
   5 #+STARTUP: hidestars
   6 #+TITLE: DMV/CCM -- todo-list / progress
   7 #+AUTHOR: Kevin Brubeck Unhammer
   8 #+EMAIL: K.BrubeckUnhammer at student uva nl
   9 #+OPTIONS: ^:{}  skip:t
  10 #+LANGUAGE: en
  11 #+SEQ_TODO: TOGROK TODO DONE
  12
  13 Meta-todo:
  14 - most as much as possible into [[file:src/common_dmv.py][common_dmv.py]]
  15   - and improve classes (at /least/ give them different names)
  16 - [[file:tex/formulas.tex][tex]] re-estimation formulas for dmv2cnf version
  17 - [[file:tex/formulas.tex][tex]] dmv2cnf IO formulas
  18   - draw some trees first? Or, they should be the same as L&Y apart
  19     from also having the STOP rules
  20 - fix reestimation for right attachment in [[file:src/loc_h-dmv.py::right%20attachment%20TODO%20try%20with%20p%20e%20e%20f%20instead%20of%20c%20for%20numerator][loc_h-dmv.py]] (divide by sum
  21   of \hat{a})
  22 - complete programming the dmv2cnf versions of dmv.py and harmonic.py
  23
  24
  25 :for rules where rule.LHS == x:
  26 :    if i+1 == j and rule.LHS == \GOR{w} or rule.LHS == \GOL{w}:
  27 :        # terminals (P_{ORDER})
  28 :
  29 :    if rule.L == STOP or rule.R == STOP:
  30 :        if rule.L == STOP: ...adj(i,...)...P_{INSIDE}(rule.R, i,j)
  31 :        if rule.R == STOP: ...adj(j,...)...P_{INSIDE}(rule.L, i,j)
  32 :
  33 :    elif j > i:
  34 :        for k where i<k<j:
  35 :
  36
  37
  38 * dmvccm report and project
  39   DEADLINE: <2008-06-30 Mon>
  40 - DMV-[[http://www.student.uib.no/~kun041/dmvccm/tex/formulas.pdf][formulas.pdf]]
  41 - [[file:src/dmv.py][dmv.py]]
  42 - [[file:src/io.py][io.py]]
  43 - [[file:src/harmonic.py::harmonic%20py%20initialization%20for%20dmv][harmonic.py]]
  44
  45 (But absolute, extended, really-quite-dead-now deadline: August 31...)
  46
  47 [[http://www.student.uib.no/~kun041/dmvccm/DMVCCM_archive.html][Archived entries]] from this file.
  48 * P_STOP and P_CHOOSE for IO/EM (reestimation)
  49 [[file:src/dmv.py::DMV%20probabilities][dmv-P_STOP]]
  50 Remember: The P_{STOP} formula is upside-down (left-to-right also).
  51 (In the article..not the [[http://www.eecs.berkeley.edu/~klein/papers/klein_thesis.pdf][thesis]])
  52
  53 ** TODO L&Y formula (20) or c()-formula?
  54 For P_CHOOSE, use this formula to get a[h_,_a_,h_]:
  55 | w_sent = 1/P_sent * \sum_{r} prob(h_->_a_ h_) * e(s,r,_a_) * e(r+1,t, h_) * f(s,t,h_) |
  56 | /                                                                                     |
  57 | v_sent = 1/P_sent * e(s,t,h_) * f(s,t,h_) = c(s,t,h_)                                 |
  58
  59 then divide a[h_,_a_,h_] by sum of a[h_,_x_,h_] for all x. \\
  60 Similarly, P_CHOOSE(a|h,R) = a[h,h,_a_] / \sum_{x} a[h,h,_x_].
  61
  62
  63 For stop rules, on the other hand, we use the following:\\
  64 PSTOP(h|left,...) =eg. c(s,t,_h_) / c(s,t,h_) for certain s,t depending on adjacency \\
  65 <=>
  66 | 1/P_sent * e(s,t,_h_) * f(s,t,_h_) |
  67 | /                                  |
  68 | 1/P_sent * e(s,t, h_) * f(s,t, h_) |
  69
  70 A direct translation of h_->STOP h_ into L&Y formula (20) would give:
  71
  72 | 1/P_sent * prob(_ h_ -> STOP h_) * e(s,t,h_) * f(s,t,_h_) |
  73 | /                                                         |
  74 | 1/P_sent * e(s,t,_h_) * f(s,t,_h_)                        |
  75
  76 But we don't want that, since what we're really after is the
  77 "upside-down" probability of stopping when "generating upwards" in the
  78 PCFG tree, so just keep using c()/c() like we've been doing.
  79
  80 In stop rules, the prob() is PSTOP, while for
  81 attachment rules, it's PCHOOSE*(1-PSTOP).
  82 ** TODO [#A] Implement P_CHOOSE formula.
  83 Earlier was assuming this, but have to change it into the above configurations:
  84
  85 | P_{CHOOSE}(a : h,R) = | \sum_{corpus} \sum_{s=loc(h)} \sum_{t > loc(h)} \sum_{loc(h) < r <= t} c(r,t,_a_)               |
  86 |                       | \sum_{corpus} \sum_{s=loc(h)} \sum_{t > loc(h)} \sum_{loc(h) < r <= t} c(s,t,h) * c(s, r-1, h_) |
  87 |                       |                                                                                                 |
  88 | P_{CHOOSE}(a : h,L) = | \sum_{corpus} \sum_{s<loc(h)} \sum_{t>=loc(h)} \sum_{r<loc(h)} c(s,r,_a_)                       |
  89 |                       | \sum_{corpus} \sum_{s<loc(h)} \sum_{t>=loc(h)} \sum_{r<loc(h)} c(s,t,h_) * c(r+1, t, h_)        |
  90 t >= loc(h) since there are many possibilites for right-attachments
  91 below, and each of them alone gives a lower probability (through
  92 multiplication) to the upper tree (so add them all)
  93
  94 The reason we have to check /both/ children of the attachments is that we
  95 have to make sure they are contiguous (otherwise we would have no way
  96 of ruling out eg. h_->_b_,_b_->b_->_a_, where h_ covers *s* and *t*,_b_ is
  97 from *s* to *x<r* and _ a_ is from *s* to *r*).
  98
  99 ** DONE P_STOP formulas for various dir and adj:
 100    CLOSED: [2008-06-15 Sun 23:40]
 101 Assuming this:
 102
 103 | P_{STOP}(STOP : h,L,non_adj) = | \sum_{corpus} \sum_{s<loc(h)} \sum_{t>=loc(h)} c(s,t,_h_) |
 104 |                                | \sum_{corpus} \sum_{s<loc(h)} \sum_{t>=loc(h)} c(s,t,h_)  |
 105 |                                |                                                           |
 106 | P_{STOP}(STOP : h,L,adj) =     | \sum_{corpus} \sum_{s=loc(h)} \sum_{t>=loc(h)} c(s,t,_h_) |
 107 |                                | \sum_{corpus} \sum_{s=loc(h)} \sum_{t>=loc(h)} c(s,t,h_)  |
 108 |                                |                                                           |
 109 | P_{STOP}(STOP : h,R,non_adj) = | \sum_{corpus} \sum_{s=loc(h)} \sum_{t>loc(h)} c(s,t,h_)   |
 110 |                                | \sum_{corpus} \sum_{s=loc(h)} \sum_{t>loc(h)} c(s,t,h)    |
 111 |                                |                                                           |
 112 | P_{STOP}(STOP : h,R,adj) =     | \sum_{corpus} \sum_{s=loc(h)} \sum_{t=loc(h)} c(s,t,h_)   |
 113 |                                | \sum_{corpus} \sum_{s=loc(h)} \sum_{t=loc(h)} c(s,t,h)    |
 114
 115 (And P_{STOP}(-STOP|...) = 1 - P_{STOP}(STOP|...) )
 116 * DONE outer probabilities
 117   CLOSED: [2008-06-12 Thu 11:11]
 118 # <<outer>>
 119 See also [[http://www.student.uib.no/~kun041/dmvccm/tex/formulas.pdf][pdf of P_{OUTER}]], in the style of Klein's thesis appendix.
 120
 121 When looping through the rules which rewrite to Node, there are 6
 122 different configurations, based on what the above (mother) node is,
 123 and what the Node for which we're computing is.
 124
 125 Here *r* is not between *s* and *t* as in inner(), but an /outer/ index. *loc_N*
 126 is the location of the Node head in the sentence, *loc_m* for the head
 127 of the mother of Node.
 128
 129 + mother is a RIGHT-stop:
 130   - outer(*s, t*, mother.LHS, *loc_N*), no inner-call
 131   - adjacent iff *t* == *loc_m*
 132 + mother is a  LEFT-stop:
 133   - outer(*s, t*, mother.LHS, *loc_N*), no inner-call
 134   - adjacent iff *s* == *loc_m*
 135
 136 + Node is on the LEFT branch (mother.L == Node)
 137   * and mother is a LEFT attachment:
 138     - *loc_N* will be in the LEFT branch, can be anything here.
 139     - In the RIGHT, non-attached, branch we find inner(*t+1, r*, mother.R,
 140       *loc_m*) for all possible *loc_m* in the right part of the sentence.
 141     - outer(*s, r*, mother.LHS, *loc_m*).
 142     - adjacent iff *t+1* == *loc_m*
 143   * and mother is a RIGHT attachment:
 144     - *loc_m* = *loc_N*.
 145     - In the RIGHT, attached, branch we find inner(*t+1, r*, mother.R, *loc_R*) for
 146       all possible *loc_R* in the right part of the sentence.
 147     - outer(*s, r*, mother.LHS, *loc_N*).
 148     - adjacent iff *t* == *loc_m*
 149
 150 + Node is on the RIGHT branch (mother.R == Node)
 151   * and mother is a LEFT attachment:
 152     - *loc_m* = *loc_N*.
 153     - In the LEFT, attached, branch we find inner(*r, s-1*, mother.L, *loc_L*) for
 154       all possible *loc_L* in the left part of the sentence.
 155     - outer(*r, t*, mother.LHS, *loc_m*).
 156     - adjacent iff *s* == *loc_m*
 157   * and mother is a RIGHT attachment:
 158     - *loc_N* will be in the RIGHT branch, can be anything here.
 159     - In the LEFT, non-attached, branch we find inner(*r, s-1*, mother.L, *loc_m*) for
 160       all possible *loc_m* in the left part of the sentence.
 161     - outer(*r, t*, mother.LHS, *loc_N*).
 162     - adjacent iff *s-1* == *loc_m*
 163
 164 [[file:outer_attachments.jpg]]
 165
 166 : in notes:   in code (constants):    in Klein thesis:
 167 :-------------------------------------------------------------------
 168 : _h_         SEAL                    bar over h
 169 :  h_         RGO_L                   right-under-left-arrow over h
 170 :  h          GO_R                    right-arrow over h
 171 :
 172 :             LGO_R                   left-under-right-arrow over h
 173 :             GO_L                    left-arrow over h
 174
 175 Also, unlike in [[http://bibsonomy.org/bibtex/2b9f6798bb092697da7042ca3f5dee795][Lari & Young]], non-ROOT ('S') symbols may cover the
 176 whole sentence, but ROOT may /only/ appear if it covers the whole
 177 sentence.
 178 ** COMMENT write out tex formulas for outer
 179 [[file:tex/formulas.tex::P_%20OUTSIDE%20SEAL%20w%20i%20j%20P_%20STOP%20stop%20w%20left%20adj%20i][formulas.tex]]
 180 * TOGROK Alternative CNF for DMV
 181 Alternatively; use rules of this form:
 182 # <<dmv2cnf>>
 183 :  h      Terminal
 184 :  h[RA]  Non-Terminal, attaching for the first time to the right
 185 :  h[RN]  Non-Terminal, attaching non-adjacently to the right
 186 :  h_[RA] Non-Terminal, stopping to the right adjacently
 187 :  h_[RN] Non-Terminal, stopping to the right non-adjacently
 188 :  h_[LA] Non-Terminal, attaching for the first time to the left
 189 :  h_[LN] Non-Terminal, attaching non-adjacently to the left
 190 : _h_[LA] Non-Terminal, stopping to the left adjacently
 191 : _h_[LN] Non-Terminal, stopping to the left non-adjacently
 192
 193 :   h[RA] -> h       _a_[LA]  # adjacent right attachment must go to "terminal"
 194 :   h[RA] -> h       _a_[LN]  # adjacent right attachment must go to "terminal"
 195 :
 196 :   h[RN] -> h[RA]   _a_[LA]  # already attached to right
 197 :   h[RN] -> h[RN]   _a_[LN]
 198 :
 199 :  h_[RA] -> h       STOP     # adjacent right stop must go to "terminal"
 200 :  h_[RN] -> h[RN]   STOP     # o/w non-adjacent
 201 :  h_[RN] -> h[RA]   STOP
 202 :
 203 :  h_[LA] -> _a_[LA] h_[RA]   # adjacent left attachment must
 204 :  h_[LA] -> _a_[LN] h_[RN]   # go to mothers of stop rules
 205 :
 206 :  h_[LN] -> _a_[LA] h_[LN]   # already attached to left
 207 :  h_[LN] -> _a_[LN] h_[LA]
 208 :
 209 : _h_[LA] -> STOP    h_[RA]   # adjacent left stop goes
 210 : _h_[LA] -> STOP    h_[RN]   # straight to a right stop
 211 :
 212 : _h_[LN] -> STOP    h_[LA]   # non-adjacent left stop
 213 : _h_[LN] -> STOP    h_[LN]   # goes to a left attachment rule
 214
 215 The reestimation function still has to sum over the various
 216 possibilities of N's and A's; but it seems to be simpler than the
 217 loc_h-method altogether.
 218
 219 One might reduce the number of rules a tiny bit, by having eg. unary rules
 220 : _a_ -> _a_[LA]
 221 : _a_ -> _a_[LN]
 222 etc. (although that might just make it all more confusing)
 223 ** COMMENT qtrees, tex
 224 \usepackage{qtree}
 225 \usepackage{amssymb}
 226
 227 \newcommand{\GOR}[1]{\overrightarrow{#1}}
 228 \newcommand{\RGOL}[1]{\overleftarrow{\overrightarrow{#1}}}
 229
 230 \newcommand{\SEAL}[1]{\overline{#1}}
 231
 232 \newcommand{\LGOR}[1]{\overrightarrow{\overleftarrow{#1}}}
 233 \newcommand{\GOL}[1]{\overleftarrow{#1}}
 234
 235 \Tree [.{$\RGOL{h}$} [.{$_s \SEAL{a} _t$\\
 236   Node} ] {$_{t+1} \RGOL{h} _r$\\
 237   R} ]
 238 \Tree [.{$\GOR{h}$} {$_{s} \GOR{h} _{t}$\\
 239   Node} [.{$_{t+1} \SEAL{a} _r$\\
 240   R} ] ]
 241 \Tree [.{$\RGOL{h}$} [.{$_r \SEAL{a} _{s-1}$\\
 242   L} ] {$_{s} \RGOL{h} _t$\\
 243   Node} ]
 244 \Tree [.{$\GOR{h}$} {$_{r} \GOR{h} _{s-1}$\\
 245   L} [.{$_{s} \SEAL{a} _t$\\
 246   Node} ] ]
 247
 248
 249 \Tree [.{$h\urcorner$} [.{$_s \ulcorner a\urcorner _t$\\
 250   Node} ] {$_{t+1} h\urcorner _r$\\
 251   R} ]
 252 \Tree [.{$h$} {$_{s} h _{t}$\\
 253   Node} [.{$_{t+1} \ulcorner a\urcorner _r$\\
 254   R} ] ]
 255 \Tree [.{$h\urcorner$} [.{$_r \ulcorner a\urcorner _{s-1}$\\
 256   L} ] {$_{s} h\urcorner _t$\\
 257   Node} ]
 258 \Tree [.{$h$} {$_{r} h _{s-1}$\\
 259   L} [.{$_{s} \ulcorner a\urcorner _t$\\
 260   Node} ] ]
 261
 262 * TOGROK Find Most Probable Parse of given test sentence
 263 We could probably do it pretty easily from the chart:
 264 - for all loc_h, select the (0,len(sent),ROOT,loc_h) that has the highest p
 265 - for stops, just keep going
 266 - for rewrites, for loc_dep select the one that has the highest p
 267 * TOGROK Combine CCM with DMV
 268 Page 108 (pdf: 124) of [[http://www.eecs.berkeley.edu/~klein/papers/klein_thesis.pdf][Klein's thesis]] gives info about this.
 269 ** TODO Make inner() and outer() also allow left-first attachment
 270 Using P_{ORDER}(/left-first/ | w) etc.
 271 * TODO Get a dependency parsed version of WSJ to test with
 272 * Initialization
 273 [[file:~/Documents/Skole/V08/Probability/dmvccm/src/dmv.py::Initialization%20todo][dmv-inits]]
 274
 275 We go through the corpus, since the probabilities are based on how far
 276 away in the sentence arguments are from their heads.
 277 ** TOGROK CCM Initialization
 278 P_{SPLIT} used here... how, again?
 279 ** DONE Separate initialization to another file?                      :PRETTIER:
 280    CLOSED: [2008-06-08 Sun 12:51]
 281 [[file:src/harmonic.py::harmonic%20py%20initialization%20for%20dmv][harmonic.py]]
 282 ** DONE DMV Initialization probabilities
 283 (from initialization frequency)
 284 ** DONE DMV Initialization frequencies
 285    CLOSED: [2008-05-27 Tue 20:04]
 286 *** P_STOP
 287 P_{STOP} is not well defined by K&M. One possible interpretation given
 288 the sentence [det nn vb nn] is
 289 : f_{STOP}( STOP|det, L, adj) +1
 290 : f_{STOP}(-STOP|det, L, adj) +0
 291 : f_{STOP}( STOP|det, L, non_adj) +1
 292 : f_{STOP}(-STOP|det, L, non_adj) +0
 293 : f_{STOP}( STOP|det, R, adj) +0
 294 : f_{STOP}(-STOP|det, R, adj) +1
 295 :
 296 : f_{STOP}( STOP|nn, L, adj) +0
 297 : f_{STOP}(-STOP|nn, L, adj) +1
 298 : f_{STOP}( STOP|nn, L, non_adj) +1  # since there's at least one to the left
 299 : f_{STOP}(-STOP|nn, L, non_adj) +0
 300 **** TODO tweak
 301 # <<pstoptweak>>
 302 :            f[head,  'STOP', 'LN'] += (i_h <= 1)     # first two words
 303 :            f[head, '-STOP', 'LN'] += (not i_h <= 1)
 304 :            f[head,  'STOP', 'LA'] += (i_h == 0)     # very first word
 305 :            f[head, '-STOP', 'LA'] += (not i_h == 0)
 306 :            f[head,  'STOP', 'RN'] += (i_h >= n - 2) # last two words
 307 :            f[head, '-STOP', 'RN'] += (not i_h >= n - 2)
 308 :            f[head,  'STOP', 'RA'] += (i_h == n - 1) # very last word
 309 :            f[head, '-STOP', 'RA'] += (not i_h == n - 1)
 310 vs
 311 :            # this one requires some additional rewriting since it
 312 :            # introduces divisions by zero
 313 :            f[head,  'STOP', 'LN'] += (i_h == 1)     # second word
 314 :            f[head, '-STOP', 'LN'] += (not i_h <= 1) # not first two
 315 :            f[head,  'STOP', 'LA'] += (i_h == 0)     # first word
 316 :            f[head, '-STOP', 'LA'] += (not i_h == 0) # not first
 317 :            f[head,  'STOP', 'RN'] += (i_h == n - 2)     # second-to-last
 318 :            f[head, '-STOP', 'RN'] += (not i_h >= n - 2) # not last two
 319 :            f[head,  'STOP', 'RA'] += (i_h == n - 1)     # last word
 320 :            f[head, '-STOP', 'RA'] += (not i_h == n - 1) # not last
 321 vs
 322 :            f[head,  'STOP', 'LN'] += (i_h == 1)     # second word
 323 :            f[head, '-STOP', 'LN'] += (not i_h == 1) # not second
 324 :            f[head,  'STOP', 'LA'] += (i_h == 0)     # first word
 325 :            f[head, '-STOP', 'LA'] += (not i_h == 0) # not first
 326 :            f[head,  'STOP', 'RN'] += (i_h == n - 2)     # second-to-last
 327 :            f[head, '-STOP', 'RN'] += (not i_h == n - 2) # not second-to-last
 328 :            f[head,  'STOP', 'RA'] += (i_h == n - 1)     # last word
 329 :            f[head, '-STOP', 'RA'] += (not i_h == n - 1) # not last
 330 vs
 331 "all words take the same number of arguments" interpreted as
 332 :for all heads:
 333 :    p_STOP(head, 'STOP', 'LN') = 0.3
 334 :    p_STOP(head, 'STOP', 'LA') = 0.5
 335 :    p_STOP(head, 'STOP', 'RN') = 0.4
 336 :    p_STOP(head, 'STOP', 'RA') = 0.7
 337 (which we easily may tweak in init_zeros())
 338 *** P_CHOOSE
 339 Go through the corpus, counting distances between heads and
 340 arguments. In [det nn vb nn], we give
 341 - f_{CHOOSE}(nn|det, R) +1/1 + C
 342 - f_{CHOOSE}(vb|det, R) +1/2 + C
 343 - f_{CHOOSE}(nn|det, R) +1/3 + C
 344   - If this were the full corpus, P_{CHOOSE}(nn|det, R) would have
 345     (1+1/3+2C) / sum_a f_{CHOOSE}(a|det, R)
 346
 347 The ROOT gets "each argument with equal probability", so in a sentence
 348 of three words, 1/3 for each (in [nn vb nn], 'nn' gets 2/3). Basically
 349 just a frequency count of the corpus...
 350
 351 In a sense there are no terminal probabilities, since an /h/ can only
 352 rewrite to an 'h' anyway (it's just a check for whether, at this
 353 location in the sentence, we have the right POS-tag).
 354 * [#C] Deferred
 355 http://wiki.python.org/moin/PythonSpeed/PerformanceTips Eg., use
 356 map/reduce/filter/[i for i in [i's]]/(i for i in [i's]) instead of
 357 for-loops; use local variables for globals (global variables or or
 358 functions), etc.
 359
 360 ** TODO when reestimating P_STOP etc, remove rules with p < epsilon   :OPTIMIZE:
 361 ** TODO inner_dmv, short ranges and impossible attachment             :OPTIMIZE:
 362 If s-t <= 2, there can be only one attachment below, so don't recurse
 363 with both Lattach=True and Rattach=True.
 364
 365 If s-t <= 1, there can be no attachment below, so only recurse with
 366 Lattach=False, Rattach=False.
 367
 368 Put this in the loop under rewrite rules (could also do it in the STOP
 369 section, but that would only have an effect on very short sentences).
 370 ** TODO clean up the module files                                     :PRETTIER:
 371 Is there better way to divide dmv and harmonic? There's a two-way
 372 dependency between the modules. Guess there could be a third file that
 373 imports both the initialization and the actual EM stuff, while a file
 374 containing constants and classes could be imported by all others:
 375 : dmv.py imports dmv_EM.py imports dmv_classes.py
 376 : dmv.py imports dmv_inits.py imports dmv_classes.py
 377
 378 ** TOGROK Some (tagged) sentences are bound to come twice             :OPTIMIZE:
 379 Eg, first sort and count, so that the corpus
 380 [['nn','vbd','det','nn'],
 381  ['vbd','nn','det','nn'],
 382  ['nn','vbd','det','nn']]
 383 becomes
 384 [(['nn','vbd','det','nn'],2),
 385  (['vbd','nn','det','nn'],1)]
 386 and then in each loop through sentences, make sure we handle the
 387 frequency correctly.
 388
 389 Is there much to gain here?
 390
 391 ** TOGROK tags as numbers or tags as strings?                         :OPTIMIZE:
 392 Need to clean up the representation.
 393
 394 Stick with tag-strings in initialization then switch to numbers for
 395 IO-algorithm perhaps? Can probably afford more string-matching in
 396 initialization..
 397 * Adjacency and combining it with the inside-outside algorithm
 398 Each DMV_Rule has both a probN and a probA, for adjacencies. inner()
 399 and outer() needs the correct one in each case.
 400
 401 In each inner() call, loc_h is the location of the head of this
 402 dependency structure. In each outer() call, it's the head of the /Node/,
 403 the structure we're looking outside of.
 404
 405 We call inner() for each location of a head, and on each terminal,
 406 loc_h must equal s (and t). In the recursive attachment calls, we use
 407 the locations (sentence indices) of words to the left or right of the
 408 head in calls to inner(). /loc_h lets us check whether we need probN or
 409 probA/.
 410 ** Possible alternate type of adjacency
 411 K&M's adjacency is just whether or not an argument has been generated
 412 in the current direction yet. One could also make a stronger type of
 413 adjacency, where h and a are not adjacent if b is in between, eg. with
 414 the sentence "a b h" and the structure ((h->a), (a->b)), h is
 415 K&M-adjacent to a, but not next to a, since b is in between. It's easy
 416 to check this type of adjacency in inner(), but it needs new rules for
 417 P_STOP reestimation.
 418 * Expectation Maximation in IO/DMV-terms
 419 outer(s,t,Node) and inner(s,t,Node) calculates the expected number of
 420 trees (CNF-)headed by Node from *s* to *t* (sentence positions). This uses
 421 the P_STOP and P_CHOOSE values, which have been conveniently
 422 distributed into CNF rules as probN and probA (non-adjacent and
 423 adjacent probabilites).
 424
 425 When re-estimating, we use the expected values from outer() and
 426 inner() to get new values for P_STOP and P_CHOOSE. When we've
 427 re-estimated for the entire corpus, we distribute P_STOP and P_CHOOSE
 428 into the CNF rules again, so that in the next round we use new probN
 429 and probA to find outer- and inner-probabilites.
 430
 431 The distribution of P_STOP and P_CHOOSE into CNF rules also happens in
 432 init_normalize() (here along with the creation of P_STOP and
 433 P_CHOOSE); P_STOP is used to create CNF rules where one branch of the
 434 rule is STOP, P_CHOOSE is used to create rules of the form
 435 : h  -> h  _a_
 436 : h_ -> h_ _a_
 437
 438 Since "adjacency" is not captured in regular CNF rules, we need two
 439 probabilites for each rule, and outer() and inner() have to know when
 440 to use which.
 441
 442
 443 * Python-stuff
 444 # <<python>>
 445 Make those debug statements steal a bit less attention in emacs:
 446 :(font-lock-add-keywords
 447 : 'python-mode                   ; not really regexp, a bit slow
 448 : '(("^\\( *\\)\\(\\if +'.+' +in +io.DEBUG. *\\(
 449 :\\1    .+$\\)+\\)" 2 font-lock-preprocessor-face t)))
 450 :(font-lock-add-keywords
 451 : 'python-mode
 452 : '(("\\<\\(\\(io\\.\\)?debug(.+)\\)" 1 font-lock-preprocessor-face t)))
 453
 454 - [[file:src/pseudo.py][pseudo.py]]
 455 - http://nltk.org/doc/en/structured-programming.html recursive dynamic
 456 - http://nltk.org/doc/en/advanced-parsing.html
 457 - http://jaynes.colorado.edu/PythonIdioms.html
 458
 459
 460
 461 * Git
 462 Repository web page: http://repo.or.cz/w/dmvccm.git
 463
 464 Setting up a new project:
 465 : git init
 466 : git add .
 467 : git commit -m "first release"
 468
 469 Later on: (=-a= does =git rm= and =git add= automatically)
 470 : git init
 471 : git commit -a -m "some subsequent release"
 472
 473 Then push stuff up to the remote server:
 474 : git push git+ssh://username@repo.or.cz/srv/git/dmvccm.git master
 475
 476 (=eval `ssh-agent`= and =ssh-add= to avoid having to type in keyphrase all
 477 the time)
 478
 479 Make a copy of the (remote) master branch:
 480 : git clone git://repo.or.cz/dmvccm.git
 481
 482 Make and name a new branch in this folder
 483 : git checkout -b mybranch
 484
 485 To save changes in =mybranch=:
 486 : git commit -a
 487
 488 Go back to the master branch (uncommitted changes from =mybranch= are
 489 carried over):
 490 : git checkout master
 491
 492 Try out:
 493 : git add --interactive
 494
 495 Good tutorial:
 496 http://www-cs-students.stanford.edu/~blynn//gitmagic/