DMVCCM.org

   1 # -*- coding: mule-utf-8-unix -*-
   2
   3 #+STARTUP: overview
   4 #+TAGS: OPTIMIZE PRETTIER
   5 #+STARTUP: hidestars
   6 #+TITLE: DMV/CCM -- todo-list / progress
   7 #+AUTHOR: Kevin Brubeck Unhammer
   8 #+EMAIL: K.BrubeckUnhammer at student uva nl
   9 #+OPTIONS: ^:{}  skip:t
  10 #+LANGUAGE: en
  11 #+SEQ_TODO: TOGROK TODO DONE
  12
  13 [[file:src/main.py][main.py]]
  14 [[file:src/wsjdep.py][wsjdep.py]]
  15 [[file:src/loc_h_dmv.py][loc_h_dmv.py]]
  16
  17 Meta-todo:
  18 - fix stop and attachment formulas so they divide before summing
  19   - both in formulas.tex and in code
  20   - and check with other values for P_ORDER
  21
  22 [[file:DMVCCM.html][DMVCCM.html]]
  23
  24 * DMV/CCM report and project
  25   DEADLINE: <2008-09-21 Sun>
  26 - DMV-[[file:tex/formulas.pdf][formulas.pdf]]  -- /clear/ information =D
  27 - [[file:src/main.py][main.py]] -- evaluation, corpus likelihoods
  28 - [[file:src/wsjdep.py][wsjdep.py]] -- corpus
  29
  30 - [[file:src/loc_h_dmv.py][loc_h_dmv.py]] -- DMV-IO and reestimation
  31 - [[file:src/loc_h_harmonic.py][loc_h_harmonic.py]] -- DMV initialization
  32
  33 - [[file:src/common_dmv.py][common_dmv.py]] -- various functions used by loc_h_dmv and others
  34 - [[file:src/io.py][io.py]] -- non-DMV IO
  35
  36 - [[file:src/cnf_dmv.py][cnf_dmv.py]] -- cnf-like implementation of DMV
  37 - [[file:src/cnf_harmonic.py][cnf_harmonic.py]] -- initialization for cnf_dmv
  38
  39 [[http://www.student.uib.no/~kun041/dmvccm/DMVCCM_archive.html][Archived entries]] from this file.
  40 * Notation
  41 : old notes:   new notes:   in tex/code (constants):    in Klein thesis:
  42 :--------------------------------------------------------------------------------------
  43 : _h_            _h_            SEAL                    bar over h
  44 :  h_             h><           RGOL                    right-under-left-arrow over h
  45 :  h              h>            GOR                     right-arrow over h
  46 :
  47 :               ><h             LGOR                    left-under-right-arrow over h
  48 :                <h             GOL                     left-arrow over h
  49 These are represented in the code as pairs =(s_h,h)=, where =h= is an
  50 integer (POS-tag) and =s_h= \in ={SEAL,RGOL,GOR,LGOR,GOL}=.
  51
  52 =P_ATTACH= and =P_CHOOSE= are synonymous, I try to use the
  53 former. Also,
  54 : P_GO_AT(a|h,dir,adj) := P_ATTACH(a|h,dir)*(1-P_STOP(STOP|h,dir,adj)
  55
  56 (precalculated after each reestimation with =g.p_GO_AT = make_GO_AT(g.p_STOP,g.p_ATTACH)=)
  57 ** COMMENT qtrees, tex
  58 \usepackage{qtree}
  59 \usepackage{amssymb}
  60
  61 \newcommand{\GOR}[1]{\overrightarrow{#1}}
  62 \newcommand{\RGOL}[1]{\overleftarrow{\overrightarrow{#1}}}
  63
  64 \newcommand{\SEAL}[1]{\overline{#1}}
  65
  66 \newcommand{\LGOR}[1]{\overrightarrow{\overleftarrow{#1}}}
  67 \newcommand{\GOL}[1]{\overleftarrow{#1}}
  68
  69 \Tree [.{$\RGOL{h}$} [.{$_s \SEAL{a} _t$\\
  70   Node} ] {$_{t+1} \RGOL{h} _r$\\
  71   R} ]
  72 \Tree [.{$\GOR{h}$} {$_{s} \GOR{h} _{t}$\\
  73   Node} [.{$_{t+1} \SEAL{a} _r$\\
  74   R} ] ]
  75 \Tree [.{$\RGOL{h}$} [.{$_r \SEAL{a} _{s-1}$\\
  76   L} ] {$_{s} \RGOL{h} _t$\\
  77   Node} ]
  78 \Tree [.{$\GOR{h}$} {$_{r} \GOR{h} _{s-1}$\\
  79   L} [.{$_{s} \SEAL{a} _t$\\
  80   Node} ] ]
  81
  82
  83 \Tree [.{$h\urcorner$} [.{$_s \ulcorner a\urcorner _t$\\
  84   Node} ] {$_{t+1} h\urcorner _r$\\
  85   R} ]
  86 \Tree [.{$h$} {$_{s} h _{t}$\\
  87   Node} [.{$_{t+1} \ulcorner a\urcorner _r$\\
  88   R} ] ]
  89 \Tree [.{$h\urcorner$} [.{$_r \ulcorner a\urcorner _{s-1}$\\
  90   L} ] {$_{s} h\urcorner _t$\\
  91   Node} ]
  92 \Tree [.{$h$} {$_{r} h _{s-1}$\\
  93   L} [.{$_{s} \ulcorner a\urcorner _t$\\
  94   Node} ] ]
  95 * Testing the dependency parsed WSJ
  96 [[file:src/wsjdep.py][wsjdep.py]] uses NLTK (sort of) to get a dependency parsed version of
  97 WSJ10 into the format used in mpp() in loc_h_dmv.py.
  98
  99 As a default, =WSJDepCorpusReader= looks for the file =wsj.combined.10.dep= in
 100 =../corpus/wsjdep=.
 101
 102 Only =sents()=, =tagged_sents()= and =parsed_sents()= (plus a new function
 103 =tagonly_sents()=) are implemented, the other NLTK corpus functions are
 104 ..um.. undefined...
 105 ** TODO [#A] Should =def evaluate= use add_root?
 106 [[file:src/main.py::def%20evaluate%20g%20tagonly_sents%20parsed_sents][main.py]] evaluate
 107 [[file:src/wsjdep.py][wsjdep.py]] add_root
 108
 109 (just has to count how many pairs are in there; Precision and Recall)
 110 * TOGROK Combine CCM with DMV
 111
 112 # <<comboquestions>>
 113
 114 Questions about the =P_COMBO= info in [[http://www.eecs.berkeley.edu/~klein/papers/klein_thesis.pdf][Klein's thesis]]:
 115 - Page 109 (pdf: 125): We have to premultiply "all our probabilities"
 116   by the CCM base product /\Pi_{<i,j>}
 117   P_{SPAN}(\alpha(i,j,s)|false)P_{CONTEXT}(\beta(i,j,s)|false)/; which
 118   probabilities are included under "all"? I'm assuming this includes
 119   =P_ATTACH= since each time =P_ATTACH= is used, /\phi/ is multiplied in
 120   (pp.110-111 ibid.); but /\phi/ is not used for STOPs, so should we not
 121   have our CCM product multiplied in there? How about =P_ROOT=?
 122   (Guessing =P_ORDER= is way out of the question...)
 123 - For the outside probabilities, is it correct to assume we multiply
 124   in /\phi(j,k)/ or /\phi(k,i)/ when calculating =inner(i,j...)=? (Eg., only
 125   for the outside part, not for the whole range.) I don't understand
 126   the notation in =O()= on p.103.
 127 * TOGROK Reestimate P_ORDER ?
 128 * Most Probable Parse
 129 ** TOGROK Find MPP with CCM
 130 ** DONE Find Most Probable Parse of given test sentence, in DMV
 131   CLOSED: [2008-07-23 Wed 10:56]
 132 inner() optionally keeps track of the highest probability children of
 133 any node in =mpptree=. Say we're looking for =inner(i,j,(s_h,h),loc_h)= in
 134 a certain sentence, and we find some possible left and right children,
 135 we add to =mpptree[i,j,(s_h,h),loc_h]= the triple =(p, L, R)= where =L= and
 136 =R= are of the same form as the key (=i,j,(s_h,h),loc_h=) and =p= is the
 137 probability of this node rewriting to =L= and =R=,
 138 eg. =inner(L)*inner(R)*p_GO_AT= or =p_STOP= or whatever. We only add this
 139 entry to =mpptree= if there wasn't a higher-probability entry there
 140 before.
 141
 142 Then, after =inner_sent= makes an =mpptree=, we find the /relevant/
 143 head-argument pairs by searching through the tree using a queue,
 144 adding the =L= and =R= keys of any entry to the queue as we find them
 145 (skipping =STOP= keys), and adding any attachment entries to a set of
 146 triples =(head,argument,dir)=. Thus we have our most probable parse,
 147 eg.
 148 : set([( ROOT, (vbd,2),RIGHT),
 149 :      ((vbd,2),(nn,1),LEFT),
 150 :      ((vbd,2),(nn,3),RIGHT),
 151 :      ((nn,1),(det,0),LEFT)])
 152 * Initialization
 153 [[file:~/Documents/Skole/V08/Probability/dmvccm/src/dmv.py::Initialization%20todo][dmv-inits]]
 154
 155 We go through the corpus, since the probabilities are based on how far
 156 away in the sentence arguments are from their heads.
 157 ** TOGROK CCM Initialization
 158 P_{SPLIT} used here... how, again?
 159 * TODO [#C] Alternative CNF for DMV
 160
 161 # <<dmv2cnf>>
 162 - [[file:src/cnf_dmv.py][cnf_dmv.py]]
 163 - [[file:src/cnf_harmonic.py][cnf_harmonic.py]]
 164
 165 See section 5 of [[file:tex/formulas.pdf][formulas.pdf]].
 166
 167 Given a grammar with certain p_ATTACH, p_STOP and p_ROOT, we get:
 168 :>>> print testgrammar_h():
 169 :  h>< -->   h>  STOP   [0.30]
 170 :  h>< -->  >h>  STOP   [0.40]
 171 : _h_  --> STOP    h><  [1.00]
 172 : _h_  --> STOP   <h><  [1.00]
 173 : >h>  -->   h>   _h_   [1.00]
 174 : >h>  -->  >h>   _h_   [1.00]
 175 : <h>< -->  _h_    h><  [0.70]
 176 : <h>< -->  _h_   <h><  [0.60]
 177 :ROOT  --> STOP   _h_   [1.00]
 178
 179 ** TODO [#A] Make and implement an equivalent grammar that's /pure/ CNF
 180 ...since I'm not sure about my unary reestimation rules (section 5 of
 181 [[file:tex/formulas.pdf][formulas]]).
 182
 183 For any rule where LHS is =_h_= we also have a corresponding one with
 184 LHS =ROOT=, only difference being that we multiply in =p_ROOT(h)=.
 185
 186 For any rule where LHS is =.h>=, we use adjacent probabilities for the
 187 left child; if LHS is =<h.= we use adjacent probabilities for the right
 188 child. Only =_h_= and =_h>_= (plus =ROOT=) get to introduce the pre-terminal
 189 =h= (where =h=, =ROOT= and =_h_= all rewrite to the terminal
 190 @<code>'h'@</code>), and only =_h_= and =_h>_= (plus =ROOT=) act as STOP
 191 rules (eg. get to multiply in =p(STOP)=).
 192
 193 :  h   -->  'h'         1
 194 : _h_  -->  'h'         p(STOP|h,L,adj) * p(STOP|h,R,adj)
 195 : ROOT -->  'h'         p(STOP|h,L,adj) * p(STOP|h,R,adj) * p_ROOT(h)
 196 :
 197 : _h_  -->   h    _a_   p(STOP|h,L,adj) * p(STOP|h,R,non) * p(a|h,R)*p(-STOP|h,R,adj)
 198 : _h_  -->   h    .h>   p(STOP|h,L,adj) * p(STOP|h,R,non)
 199 : .h>  -->  _a_   _b_   p(a|h,R)*p(-STOP|h,R,adj) * p(b|h,R)*p(-STOP|h,R,non)
 200 : .h>  -->  _a_    h>   p(a|h,R)*p(-STOP|h,R,adj)
 201 :  h>  -->  _a_   _b_   p(a|h,R)*p(-STOP|h,R,non) * p(b|h,R)*p(-STOP|h,R,non)
 202 :  h>  -->  _a_    h>   p(a|h,R)*p(-STOP|h,R,non)
 203 :
 204 : _h_  -->  _a_    h    p(STOP|h,L,non) * p(STOP|h,R,adj) * p(a|h,L)*p(-STOP|h,L,adj)
 205 : _h_  -->  <h.    h    p(STOP|h,L,non) * p(STOP|h,R,adj)
 206 : <h.  -->  _b_   _a_   p(b|h,L)*p(-STOP|h,L,non) * p(a|h,L)*p(-STOP|h,L,adj)
 207 : <h.  -->  <h    _a_                               p(a|h,L)*p(-STOP|h,L,adj)
 208 : <h   -->  _a_   _b_   p(a|h,L)*p(-STOP|h,L,non) * p(b|h,L)*p(-STOP|h,L,non)
 209 : <h   -->  <h    _a_   p(a|h,L)*p(-STOP|h,L,non)
 210 :
 211 : _h_  -->  <h.    _h>_ p(STOP|h,L,non)
 212 : _h_  -->  _a_    _h>_ p(STOP|h,L,non) * p(a|h,L)*p(-STOP|h,L,adj)
 213 : _h>_ -->   h     .h>  p(STOP|h,R,non)
 214 : _h>_ -->   h     _a_  p(STOP|h,R,non) * p(a|h,R)*p(-STOP|h,R,adj)
 215 :
 216 : ROOT -->   h    _a_   p(STOP|h,L,adj) * p(STOP|h,R,non) * p(a|h,R)*p(-STOP|h,R,adj) * p_ROOT(h)
 217 : ROOT -->   h    .h>   p(STOP|h,L,adj) * p(STOP|h,R,non) * p_ROOT(h)
 218 :
 219 : ROOT -->  _a_    h    p(STOP|h,L,non) * p(STOP|h,R,adj) * p(a|h,L)*p(-STOP|h,L,adj) * p_ROOT(h)
 220 : ROOT -->  <h.    h    p(STOP|h,L,non) * p(STOP|h,R,adj) * p_ROOT(h)
 221 :
 222 : ROOT -->  <h.   _h>_  p(STOP|h,L,non) * p_ROOT(h)
 223 : ROOT -->  _a_   _h>_  p(STOP|h,L,non) * p(a|h,L)*p(-STOP|h,L,adj) * p_ROOT(h)
 224 :
 225
 226 Since we have rules rewriting =h= to =a= and =b=, we have a rule-set
 227 numbering more than n_{tags}^{2}.
 228
 229 ** TOGROK [#A] convert L&Y-based reestimation into P_ATTACH and P_STOP values
 230 Sum over the various rules? Or something? Must think of this.
 231 ** TODO [#C] move as much as possible into common_dmv.py
 232 [[file:src/common_dmv.py][common_dmv.py]]
 233 ** DONE L&Y-based reestimation for cnf_dmv
 234    CLOSED: [2008-08-21 Thu 16:35]
 235 ** DONE dmv2cnf re-estimation formulas
 236    CLOSED: [2008-08-21 Thu 16:36]
 237 ** DONE inner and outer for cnf_dmv.py, also cnf_harmonic.py
 238 * [#C] Deferred
 239 http://wiki.python.org/moin/PythonSpeed/PerformanceTips Eg., use
 240 map/reduce/filter/[i for i in [i's]]/(i for i in [i's]) instead of
 241 for-loops; use local variables for globals (global variables or or
 242 functions), etc.
 243 ** TODO Clean up reestimation code                                    :PRETTIER:
 244 ** TODO [#A] compare speed of w_left/right(...) and w(LEFT/RIGHT, ...) :OPTIMIZE:
 245 ** TODO when reestimating P_STOP etc, remove rules with p < epsilon   :OPTIMIZE:
 246 ** TODO inner_dmv, short ranges and impossible attachment             :OPTIMIZE:
 247 If s-t <= 2, there can be only one attachment below, so don't recurse
 248 with both Lattach=True and Rattach=True.
 249
 250 If s-t <= 1, there can be no attachment below, so only recurse with
 251 Lattach=False, Rattach=False.
 252
 253 Put this in the loop under rewrite rules (could also do it in the STOP
 254 section, but that would only have an effect on very short sentences).
 255 ** TODO clean up the module files                                     :PRETTIER:
 256 Is there better way to divide dmv and harmonic? There's a two-way
 257 dependency between the modules. Guess there could be a third file that
 258 imports both the initialization and the actual EM stuff, while a file
 259 containing constants and classes could be imported by all others:
 260 : dmv.py imports dmv_EM.py imports dmv_classes.py
 261 : dmv.py imports dmv_inits.py imports dmv_classes.py
 262
 263 ** TOGROK Some (tagged) sentences are bound to come twice             :OPTIMIZE:
 264 Eg, first sort and count, so that the corpus
 265 [['nn','vbd','det','nn'],
 266  ['vbd','nn','det','nn'],
 267  ['nn','vbd','det','nn']]
 268 becomes
 269 [(['nn','vbd','det','nn'],2),
 270  (['vbd','nn','det','nn'],1)]
 271 and then in each loop through sentences, make sure we handle the
 272 frequency correctly.
 273
 274 Is there much to gain here?
 275
 276 ** TOGROK tags as numbers or tags as strings?                         :OPTIMIZE:
 277 Need to clean up the representation.
 278
 279 Stick with tag-strings in initialization then switch to numbers for
 280 IO-algorithm perhaps? Can probably afford more string-matching in
 281 initialization..
 282 * Adjacency and combining it with the inside-outside algorithm
 283 Each DMV_Rule has both a probN and a probA, for adjacencies. inner()
 284 and outer() needs the correct one in each case.
 285
 286 In each inner() call, loc_h is the location of the head of this
 287 dependency structure. In each outer() call, it's the head of the /Node/,
 288 the structure we're looking outside of.
 289
 290 We call inner() for each location of a head, and on each terminal,
 291 loc_h must equal =i= (and =loc_h+1= equal =j=). In the recursive attachment
 292 calls, we use the locations (sentence indices) of words to the left or
 293 right of the head in calls to inner(). /loc_h lets us check whether we
 294 need probN or probA/.
 295 ** Possible alternate type of adjacency
 296 K&M's adjacency is just whether or not an argument has been generated
 297 in the current direction yet. One could also make a stronger type of
 298 adjacency, where h and a are not adjacent if b is in between, eg. with
 299 the sentence "a b h" and the structure ((h->a), (a->b)), h is
 300 K&M-adjacent to a, but not next to a, since b is in between. It's easy
 301 to check this type of adjacency in inner(), but it needs new rules for
 302 P_STOP reestimation.
 303 * Python-stuff
 304 # <<python>>
 305 Make those debug statements steal a bit less attention in emacs:
 306 :(font-lock-add-keywords
 307 : 'python-mode                   ; not really regexp, a bit slow
 308 : '(("^\\( *\\)\\(\\if +'.+' +in +io.DEBUG. *\\(
 309 :\\1    .+$\\)+\\)" 2 font-lock-preprocessor-face t)))
 310 :(font-lock-add-keywords
 311 : 'python-mode
 312 : '(("\\<\\(\\(io\\.\\)?debug(.+)\\)" 1 font-lock-preprocessor-face t)))
 313
 314 - [[file:src/pseudo.py][pseudo.py]]
 315 - http://nltk.org/doc/en/structured-programming.html recursive dynamic
 316 - http://nltk.org/doc/en/advanced-parsing.html
 317 - http://jaynes.colorado.edu/PythonIdioms.html
 318
 319
 320
 321 * Git
 322 Repository web page: http://repo.or.cz/w/dmvccm.git
 323
 324 Setting up a new project:
 325 : git init
 326 : git add .
 327 : git commit -m "first release"
 328
 329 Later on: (=-a= does =git rm= and =git add= automatically)
 330 : git init
 331 : git commit -a -m "some subsequent release"
 332
 333 Then push stuff up to the remote server:
 334 : git push git+ssh://username@repo.or.cz/srv/git/dmvccm.git master
 335
 336 (=eval `ssh-agent`= and =ssh-add= to avoid having to type in keyphrase all
 337 the time)
 338
 339 Make a copy of the (remote) master branch:
 340 : git clone git://repo.or.cz/dmvccm.git
 341
 342 Make and name a new branch in this folder
 343 : git checkout -b mybranch
 344
 345 To save changes in =mybranch=:
 346 : git commit -a
 347
 348 Go back to the master branch (uncommitted changes from =mybranch= are
 349 carried over):
 350 : git checkout master
 351
 352 Try out:
 353 : git add --interactive
 354
 355 Good tutorial:
 356 http://www-cs-students.stanford.edu/~blynn//gitmagic/