DMVCCM.org

   1 # -*- coding: mule-utf-8-unix -*-
   2
   3 #+STARTUP: overview
   4 #+TAGS: OPTIMIZE PRETTIER
   5 #+STARTUP: hidestars
   6 #+TITLE: DMV/CCM -- todo-list / progress
   7 #+AUTHOR: Kevin Brubeck Unhammer
   8 #+EMAIL: K.BrubeckUnhammer at student uva nl
   9 #+OPTIONS: ^:{}  skip:t
  10 #+LANGUAGE: en
  11 #+SEQ_TODO: TOGROK TODO DONE
  12
  13 [[file:src/main.py][main.py]]
  14 [[file:src/wsjdep.py][wsjdep.py]]
  15 [[file:src/loc_h_dmv.py][loc_h_dmv.py]]
  16
  17 Trying out /not/ dividing by sum_hat_a now (in reestimate2), see how
  18 that goes...
  19
  20 Meta-todo:
  21 - debug reestimate2 which stores charts for all sentences and has
  22   arguments as the outer loop
  23   - have to fix the lack of attachment probabilities...
  24 - fix cnf outer
  25
  26 [[file:DMVCCM.html][DMVCCM.html]]
  27
  28 * DMV/CCM report and project
  29 - DMV-[[file:tex/formulas.pdf][formulas.pdf]]  -- /clear/ information =D
  30 - [[file:src/main.py][main.py]] -- evaluation, corpus likelihoods
  31 - [[file:src/wsjdep.py][wsjdep.py]] -- corpus
  32
  33 - [[file:src/loc_h_dmv.py][loc_h_dmv.py]] -- DMV-IO and reestimation
  34 - [[file:src/loc_h_harmonic.py][loc_h_harmonic.py]] -- DMV initialization
  35
  36 - [[file:src/common_dmv.py][common_dmv.py]] -- various functions used by loc_h_dmv and others
  37 - [[file:src/io.py][io.py]] -- non-DMV IO
  38
  39 - [[file:src/cnf_dmv.py][cnf_dmv.py]] -- cnf-like implementation of DMV
  40 - [[file:src/cnf_harmonic.py][cnf_harmonic.py]] -- initialization for cnf_dmv
  41
  42 [[http://www.student.uib.no/~kun041/dmvccm/DMVCCM_archive.html][Archived entries]] from this file.
  43 * Notation
  44 : old notes:   new notes:   in tex/code (constants):    in Klein thesis:
  45 :--------------------------------------------------------------------------------------
  46 : _h_            _h_            SEAL                    bar over h
  47 :  h_             h><           RGOL                    right-under-left-arrow over h
  48 :  h              h>            GOR                     right-arrow over h
  49 :
  50 :               ><h             LGOR                    left-under-right-arrow over h
  51 :                <h             GOL                     left-arrow over h
  52 These are represented in the code as pairs =(s_h,h)=, where =h= is an
  53 integer (POS-tag) and =s_h= \in ={SEAL,RGOL,GOR,LGOR,GOL}=.
  54
  55 =P_ATTACH= and =P_CHOOSE= are synonymous, I try to use the
  56 former. Also,
  57 : P_GO_AT(a|h,dir,adj) := P_ATTACH(a|h,dir)*(1-P_STOP(STOP|h,dir,adj)
  58
  59 (precalculated after each reestimation with =g.p_GO_AT = make_GO_AT(g.p_STOP,g.p_ATTACH)=)
  60 ** COMMENT qtrees, tex
  61 \usepackage{qtree}
  62 \usepackage{amssymb}
  63
  64 \newcommand{\GOR}[1]{\overrightarrow{#1}}
  65 \newcommand{\RGOL}[1]{\overleftarrow{\overrightarrow{#1}}}
  66
  67 \newcommand{\SEAL}[1]{\overline{#1}}
  68
  69 \newcommand{\LGOR}[1]{\overrightarrow{\overleftarrow{#1}}}
  70 \newcommand{\GOL}[1]{\overleftarrow{#1}}
  71
  72 \Tree [.{$\RGOL{h}$} [.{$_s \SEAL{a} _t$\\
  73   Node} ] {$_{t+1} \RGOL{h} _r$\\
  74   R} ]
  75 \Tree [.{$\GOR{h}$} {$_{s} \GOR{h} _{t}$\\
  76   Node} [.{$_{t+1} \SEAL{a} _r$\\
  77   R} ] ]
  78 \Tree [.{$\RGOL{h}$} [.{$_r \SEAL{a} _{s-1}$\\
  79   L} ] {$_{s} \RGOL{h} _t$\\
  80   Node} ]
  81 \Tree [.{$\GOR{h}$} {$_{r} \GOR{h} _{s-1}$\\
  82   L} [.{$_{s} \SEAL{a} _t$\\
  83   Node} ] ]
  84
  85
  86 \Tree [.{$h\urcorner$} [.{$_s \ulcorner a\urcorner _t$\\
  87   Node} ] {$_{t+1} h\urcorner _r$\\
  88   R} ]
  89 \Tree [.{$h$} {$_{s} h _{t}$\\
  90   Node} [.{$_{t+1} \ulcorner a\urcorner _r$\\
  91   R} ] ]
  92 \Tree [.{$h\urcorner$} [.{$_r \ulcorner a\urcorner _{s-1}$\\
  93   L} ] {$_{s} h\urcorner _t$\\
  94   Node} ]
  95 \Tree [.{$h$} {$_{r} h _{s-1}$\\
  96   L} [.{$_{s} \ulcorner a\urcorner _t$\\
  97   Node} ] ]
  98 * Testing the dependency parsed WSJ
  99 [[file:src/wsjdep.py][wsjdep.py]] uses NLTK (sort of) to get a dependency parsed version of
 100 WSJ10 into the format used in mpp() in loc_h_dmv.py.
 101
 102 As a default, =WSJDepCorpusReader= looks for the file =wsj.combined.10.dep= in
 103 =../corpus/wsjdep=.
 104
 105 Only =sents()=, =tagged_sents()= and =parsed_sents()= (plus a new function
 106 =tagonly_sents()=) are implemented, the other NLTK corpus functions are
 107 ..um.. undefined...
 108 ** TODO [#A] Should =def evaluate= use add_root?
 109 [[file:src/main.py::def%20evaluate%20g%20tagonly_sents%20parsed_sents][main.py]] evaluate
 110 [[file:src/wsjdep.py][wsjdep.py]] add_root
 111
 112 (just has to count how many pairs are in there; Precision and Recall)
 113 * TODO [#C] Alternative CNF for DMV
 114
 115 # <<dmv2cnf>>
 116 - [[file:src/cnf_dmv.py][cnf_dmv.py]]
 117 - [[file:src/cnf_harmonic.py][cnf_harmonic.py]]
 118
 119 See section 5 of [[file:tex/formulas.pdf][formulas.pdf]].
 120
 121 Given a grammar with certain p_ATTACH, p_STOP and p_ROOT, we get:
 122 :>>> print testgrammar_h():
 123 :  h>< -->   h>  STOP   [0.30]
 124 :  h>< -->  >h>  STOP   [0.40]
 125 : _h_  --> STOP    h><  [1.00]
 126 : _h_  --> STOP   <h><  [1.00]
 127 : >h>  -->   h>   _h_   [1.00]
 128 : >h>  -->  >h>   _h_   [1.00]
 129 : <h>< -->  _h_    h><  [0.70]
 130 : <h>< -->  _h_   <h><  [0.60]
 131 :ROOT  --> STOP   _h_   [1.00]
 132
 133 ** TODO [#A] Make and implement an equivalent grammar that's /pure/ CNF
 134 ...since I'm not sure about my unary reestimation rules (section 5 of
 135 [[file:tex/formulas.pdf][formulas]]).
 136
 137 For any rule where LHS is =_h_= we also have a corresponding one with
 138 LHS =ROOT=, only difference being that we multiply in =p_ROOT(h)=.
 139
 140 For any rule where LHS is =.h>=, we use adjacent probabilities for the
 141 left child; if LHS is =<h.= we use adjacent probabilities for the right
 142 child. Only =_h_= and =_h>_= (plus =ROOT=) get to introduce the pre-terminal
 143 =h= (where =h=, =ROOT= and =_h_= all rewrite to the terminal
 144 @<code>'h'@</code>), and only =_h_= and =_h>_= (plus =ROOT=) act as STOP
 145 rules (eg. get to multiply in =p(STOP)=).
 146
 147 :  h   -->  'h'         1
 148 : _h_  -->  'h'         p(STOP|h,L,adj) * p(STOP|h,R,adj)
 149 : ROOT -->  'h'         p(STOP|h,L,adj) * p(STOP|h,R,adj) * p_ROOT(h)
 150 :
 151 : _h_  -->   h    _a_   p(STOP|h,L,adj) * p(STOP|h,R,non) * p(a|h,R)*p(-STOP|h,R,adj)
 152 : _h_  -->   h    .h>   p(STOP|h,L,adj) * p(STOP|h,R,non)
 153 : .h>  -->  _a_   _b_   p(a|h,R)*p(-STOP|h,R,adj) * p(b|h,R)*p(-STOP|h,R,non)
 154 : .h>  -->  _a_    h>   p(a|h,R)*p(-STOP|h,R,adj)
 155 :  h>  -->  _a_   _b_   p(a|h,R)*p(-STOP|h,R,non) * p(b|h,R)*p(-STOP|h,R,non)
 156 :  h>  -->  _a_    h>   p(a|h,R)*p(-STOP|h,R,non)
 157 :
 158 : _h_  -->  _a_    h    p(STOP|h,L,non) * p(STOP|h,R,adj) * p(a|h,L)*p(-STOP|h,L,adj)
 159 : _h_  -->  <h.    h    p(STOP|h,L,non) * p(STOP|h,R,adj)
 160 : <h.  -->  _b_   _a_   p(b|h,L)*p(-STOP|h,L,non) * p(a|h,L)*p(-STOP|h,L,adj)
 161 : <h.  -->  <h    _a_                               p(a|h,L)*p(-STOP|h,L,adj)
 162 : <h   -->  _a_   _b_   p(a|h,L)*p(-STOP|h,L,non) * p(b|h,L)*p(-STOP|h,L,non)
 163 : <h   -->  <h    _a_   p(a|h,L)*p(-STOP|h,L,non)
 164 :
 165 : _h_  -->  <h.    _h>_ p(STOP|h,L,non)
 166 : _h_  -->  _a_    _h>_ p(STOP|h,L,non) * p(a|h,L)*p(-STOP|h,L,adj)
 167 : _h>_ -->   h     .h>  p(STOP|h,R,non)
 168 : _h>_ -->   h     _a_  p(STOP|h,R,non) * p(a|h,R)*p(-STOP|h,R,adj)
 169 :
 170 : ROOT -->   h    _a_   p(STOP|h,L,adj) * p(STOP|h,R,non) * p(a|h,R)*p(-STOP|h,R,adj) * p_ROOT(h)
 171 : ROOT -->   h    .h>   p(STOP|h,L,adj) * p(STOP|h,R,non) * p_ROOT(h)
 172 :
 173 : ROOT -->  _a_    h    p(STOP|h,L,non) * p(STOP|h,R,adj) * p(a|h,L)*p(-STOP|h,L,adj) * p_ROOT(h)
 174 : ROOT -->  <h.    h    p(STOP|h,L,non) * p(STOP|h,R,adj) * p_ROOT(h)
 175 :
 176 : ROOT -->  <h.   _h>_  p(STOP|h,L,non) * p_ROOT(h)
 177 : ROOT -->  _a_   _h>_  p(STOP|h,L,non) * p(a|h,L)*p(-STOP|h,L,adj) * p_ROOT(h)
 178 :
 179
 180 Since we have rules rewriting =h= to =a= and =b=, we have a rule-set
 181 numbering more than n_{tags}^{2}.
 182
 183 ** TOGROK [#A] convert L&Y-based reestimation into P_ATTACH and P_STOP values
 184 Sum over the various rules? Or something? Must think of this.
 185 ** TODO [#C] move as much as possible into common_dmv.py
 186 [[file:src/common_dmv.py][common_dmv.py]]
 187 ** DONE L&Y-based reestimation for cnf_dmv
 188    CLOSED: [2008-08-21 Thu 16:35]
 189 ** DONE dmv2cnf re-estimation formulas
 190    CLOSED: [2008-08-21 Thu 16:36]
 191 ** DONE inner and outer for cnf_dmv.py, also cnf_harmonic.py
 192 * TOGROK Combine CCM with DMV
 193
 194 # <<comboquestions>>
 195
 196 Questions about the =P_COMBO= info in [[http://www.eecs.berkeley.edu/~klein/papers/klein_thesis.pdf][Klein's thesis]]:
 197 - Page 109 (pdf: 125): We have to premultiply "all our probabilities"
 198   by the CCM base product /\Pi_{<i,j>}
 199   P_{SPAN}(\alpha(i,j,s)|false)P_{CONTEXT}(\beta(i,j,s)|false)/; which
 200   probabilities are included under "all"? I'm assuming this includes
 201   =P_ATTACH= since each time =P_ATTACH= is used, /\phi/ is multiplied in
 202   (pp.110-111 ibid.); but /\phi/ is not used for STOPs, so should we not
 203   have our CCM product multiplied in there? How about =P_ROOT=?
 204   (Guessing =P_ORDER= is way out of the question...)
 205 - For the outside probabilities, is it correct to assume we multiply
 206   in /\phi(j,k)/ or /\phi(k,i)/ when calculating =inner(i,j...)=? (Eg., only
 207   for the outside part, not for the whole range.) I don't understand
 208   the notation in =O()= on p.103.
 209 * TOGROK Reestimate P_ORDER ?
 210 * Most Probable Parse
 211 ** TOGROK Find MPP with CCM
 212 ** DONE Find Most Probable Parse of given test sentence, in DMV
 213   CLOSED: [2008-07-23 Wed 10:56]
 214 inner() optionally keeps track of the highest probability children of
 215 any node in =mpptree=. Say we're looking for =inner(i,j,(s_h,h),loc_h)= in
 216 a certain sentence, and we find some possible left and right children,
 217 we add to =mpptree[i,j,(s_h,h),loc_h]= the triple =(p, L, R)= where =L= and
 218 =R= are of the same form as the key (=i,j,(s_h,h),loc_h=) and =p= is the
 219 probability of this node rewriting to =L= and =R=,
 220 eg. =inner(L)*inner(R)*p_GO_AT= or =p_STOP= or whatever. We only add this
 221 entry to =mpptree= if there wasn't a higher-probability entry there
 222 before.
 223
 224 Then, after =inner_sent= makes an =mpptree=, we find the /relevant/
 225 head-argument pairs by searching through the tree using a queue,
 226 adding the =L= and =R= keys of any entry to the queue as we find them
 227 (skipping =STOP= keys), and adding any attachment entries to a set of
 228 triples =(head,argument,dir)=. Thus we have our most probable parse,
 229 eg.
 230 : set([( ROOT, (vbd,2),RIGHT),
 231 :      ((vbd,2),(nn,1),LEFT),
 232 :      ((vbd,2),(nn,3),RIGHT),
 233 :      ((nn,1),(det,0),LEFT)])
 234 * Initialization
 235 [[file:~/Documents/Skole/V08/Probability/dmvccm/src/dmv.py::Initialization%20todo][dmv-inits]]
 236
 237 We go through the corpus, since the probabilities are based on how far
 238 away in the sentence arguments are from their heads.
 239 ** TOGROK CCM Initialization
 240 P_{SPLIT} used here... how, again?
 241 * [#C] Deferred
 242 http://wiki.python.org/moin/PythonSpeed/PerformanceTips Eg., use
 243 map/reduce/filter/[i for i in [i's]]/(i for i in [i's]) instead of
 244 for-loops; use local variables for globals (global variables or or
 245 functions), etc.
 246 ** TODO Clean up reestimation code                                    :PRETTIER:
 247 ** TODO [#A] compare speed of w_left/right(...) and w(LEFT/RIGHT, ...) :OPTIMIZE:
 248 ** TODO when reestimating P_STOP etc, remove rules with p < epsilon   :OPTIMIZE:
 249 ** TODO inner_dmv, short ranges and impossible attachment             :OPTIMIZE:
 250 If s-t <= 2, there can be only one attachment below, so don't recurse
 251 with both Lattach=True and Rattach=True.
 252
 253 If s-t <= 1, there can be no attachment below, so only recurse with
 254 Lattach=False, Rattach=False.
 255
 256 Put this in the loop under rewrite rules (could also do it in the STOP
 257 section, but that would only have an effect on very short sentences).
 258 ** TODO clean up the module files                                     :PRETTIER:
 259 Is there better way to divide dmv and harmonic? There's a two-way
 260 dependency between the modules. Guess there could be a third file that
 261 imports both the initialization and the actual EM stuff, while a file
 262 containing constants and classes could be imported by all others:
 263 : dmv.py imports dmv_EM.py imports dmv_classes.py
 264 : dmv.py imports dmv_inits.py imports dmv_classes.py
 265
 266 ** TOGROK Some (tagged) sentences are bound to come twice             :OPTIMIZE:
 267 Eg, first sort and count, so that the corpus
 268 [['nn','vbd','det','nn'],
 269  ['vbd','nn','det','nn'],
 270  ['nn','vbd','det','nn']]
 271 becomes
 272 [(['nn','vbd','det','nn'],2),
 273  (['vbd','nn','det','nn'],1)]
 274 and then in each loop through sentences, make sure we handle the
 275 frequency correctly.
 276
 277 Is there much to gain here?
 278
 279 ** TOGROK tags as numbers or tags as strings?                         :OPTIMIZE:
 280 Need to clean up the representation.
 281
 282 Stick with tag-strings in initialization then switch to numbers for
 283 IO-algorithm perhaps? Can probably afford more string-matching in
 284 initialization..
 285 * Adjacency and combining it with the inside-outside algorithm
 286 Each DMV_Rule has both a probN and a probA, for adjacencies. inner()
 287 and outer() needs the correct one in each case.
 288
 289 In each inner() call, loc_h is the location of the head of this
 290 dependency structure. In each outer() call, it's the head of the /Node/,
 291 the structure we're looking outside of.
 292
 293 We call inner() for each location of a head, and on each terminal,
 294 loc_h must equal =i= (and =loc_h+1= equal =j=). In the recursive attachment
 295 calls, we use the locations (sentence indices) of words to the left or
 296 right of the head in calls to inner(). /loc_h lets us check whether we
 297 need probN or probA/.
 298 ** Possible alternate type of adjacency
 299 K&M's adjacency is just whether or not an argument has been generated
 300 in the current direction yet. One could also make a stronger type of
 301 adjacency, where h and a are not adjacent if b is in between, eg. with
 302 the sentence "a b h" and the structure ((h->a), (a->b)), h is
 303 K&M-adjacent to a, but not next to a, since b is in between. It's easy
 304 to check this type of adjacency in inner(), but it needs new rules for
 305 P_STOP reestimation.
 306 * Python-stuff
 307 # <<python>>
 308 Make those debug statements steal a bit less attention in emacs:
 309 :(font-lock-add-keywords
 310 : 'python-mode                   ; not really regexp, a bit slow
 311 : '(("^\\( *\\)\\(\\if +'.+' +in +io.DEBUG. *\\(
 312 :\\1    .+$\\)+\\)" 2 font-lock-preprocessor-face t)))
 313 :(font-lock-add-keywords
 314 : 'python-mode
 315 : '(("\\<\\(\\(io\\.\\)?debug(.+)\\)" 1 font-lock-preprocessor-face t)))
 316
 317 - [[file:src/pseudo.py][pseudo.py]]
 318 - http://nltk.org/doc/en/structured-programming.html recursive dynamic
 319 - http://nltk.org/doc/en/advanced-parsing.html
 320 - http://jaynes.colorado.edu/PythonIdioms.html
 321
 322
 323
 324 * Git
 325 Repository web page: http://repo.or.cz/w/dmvccm.git
 326
 327 Setting up a new project:
 328 : git init
 329 : git add .
 330 : git commit -m "first release"
 331
 332 Later on: (=-a= does =git rm= and =git add= automatically)
 333 : git init
 334 : git commit -a -m "some subsequent release"
 335
 336 Then push stuff up to the remote server:
 337 : git push git+ssh://username@repo.or.cz/srv/git/dmvccm.git master
 338
 339 (=eval `ssh-agent`= and =ssh-add= to avoid having to type in keyphrase all
 340 the time)
 341
 342 Make a copy of the (remote) master branch:
 343 : git clone git://repo.or.cz/dmvccm.git
 344
 345 Make and name a new branch in this folder
 346 : git checkout -b mybranch
 347
 348 To save changes in =mybranch=:
 349 : git commit -a
 350
 351 Go back to the master branch (uncommitted changes from =mybranch= are
 352 carried over):
 353 : git checkout master
 354
 355 Try out:
 356 : git add --interactive
 357
 358 Good tutorial:
 359 http://www-cs-students.stanford.edu/~blynn//gitmagic/