DMVCCM.org

   1 # -*- coding: mule-utf-8-unix -*-
   2
   3 #+STARTUP: overview
   4 #+TAGS: OPTIMIZE PRETTIER
   5 #+STARTUP: hidestars
   6 #+TITLE: DMV/CCM -- todo-list / progress
   7 #+AUTHOR: Kevin Brubeck Unhammer
   8 #+EMAIL: K.BrubeckUnhammer at student uva nl
   9 #+OPTIONS: ^:{}  skip:t
  10 #+LANGUAGE: en
  11 #+SEQ_TODO: TOGROK TODO DONE
  12
  13 [[file:src/main.py][main.py]]
  14 [[file:src/wsjdep.py][wsjdep.py]]
  15 [[file:src/loc_h_dmv.py][loc_h_dmv.py]]
  16
  17 Meta-todo:
  18 - make test battery
  19 - change the h's to h_l in [[file:tex/formulas.pdf][formulas.pdf]]
  20
  21 [[file:DMVCCM.html][DMVCCM.html]]
  22
  23 * DMV/CCM report and project
  24 - DMV-[[file:tex/formulas.pdf][formulas.pdf]]  -- /clear/ information =D
  25 - [[file:src/main.py][main.py]] -- evaluation
  26 - [[file:src/wsjdep.py][wsjdep.py]] -- corpus
  27 - [[file:src/loc_h_dmv.py][loc_h_dmv.py]] -- DMV-IO and reestimation
  28 - [[file:src/loc_h_harmonic.py][loc_h_harmonic.py]] -- DMV initialization
  29 - [[file:src/common_dmv.py][common_dmv.py]] -- various functions used by loc_h_dmv and others
  30 - [[file:src/io.py][io.py]] -- non-DMV IO
  31 # - [[file:src/cnf_harmonic.py][cnf_harmonic.py]]
  32 # - [[file:src/cnf_dmv.py][cnf_dmv.py]]
  33
  34
  35 [[http://www.student.uib.no/~kun041/dmvccm/DMVCCM_archive.html][Archived entries]] from this file.
  36 * Notation
  37 : old notes:   new notes:   in tex/code (constants):    in Klein thesis:
  38 :--------------------------------------------------------------------------------------
  39 : _h_            _h_            SEAL                    bar over h
  40 :  h_             h><           RGOL                    right-under-left-arrow over h
  41 :  h              h>            GOR                     right-arrow over h
  42 :
  43 :               ><h             LGOR                    left-under-right-arrow over h
  44 :                <h             GOL                     left-arrow over h
  45 These are represented in the code as pairs =(s_h,h)=, where =h= is an
  46 integer (POS-tag) and =s_h= \in ={SEAL,RGOL,GOR,LGOR,GOL}=.
  47
  48 =P_ATTACH= and =P_CHOOSE= are synonymous, I try to use the
  49 former. Also,
  50 : P_GO_AT(a|h,dir,adj) := P_ATTACH(a|h,dir)*(1-P_STOP(STOP|h,dir,adj)
  51
  52 (precalculated after each reestimation with =g.p_GO_AT = make_GO_AT(g.p_STOP,g.p_ATTACH)=)
  53 ** COMMENT qtrees, tex
  54 \usepackage{qtree}
  55 \usepackage{amssymb}
  56
  57 \newcommand{\GOR}[1]{\overrightarrow{#1}}
  58 \newcommand{\RGOL}[1]{\overleftarrow{\overrightarrow{#1}}}
  59
  60 \newcommand{\SEAL}[1]{\overline{#1}}
  61
  62 \newcommand{\LGOR}[1]{\overrightarrow{\overleftarrow{#1}}}
  63 \newcommand{\GOL}[1]{\overleftarrow{#1}}
  64
  65 \Tree [.{$\RGOL{h}$} [.{$_s \SEAL{a} _t$\\
  66   Node} ] {$_{t+1} \RGOL{h} _r$\\
  67   R} ]
  68 \Tree [.{$\GOR{h}$} {$_{s} \GOR{h} _{t}$\\
  69   Node} [.{$_{t+1} \SEAL{a} _r$\\
  70   R} ] ]
  71 \Tree [.{$\RGOL{h}$} [.{$_r \SEAL{a} _{s-1}$\\
  72   L} ] {$_{s} \RGOL{h} _t$\\
  73   Node} ]
  74 \Tree [.{$\GOR{h}$} {$_{r} \GOR{h} _{s-1}$\\
  75   L} [.{$_{s} \SEAL{a} _t$\\
  76   Node} ] ]
  77
  78
  79 \Tree [.{$h\urcorner$} [.{$_s \ulcorner a\urcorner _t$\\
  80   Node} ] {$_{t+1} h\urcorner _r$\\
  81   R} ]
  82 \Tree [.{$h$} {$_{s} h _{t}$\\
  83   Node} [.{$_{t+1} \ulcorner a\urcorner _r$\\
  84   R} ] ]
  85 \Tree [.{$h\urcorner$} [.{$_r \ulcorner a\urcorner _{s-1}$\\
  86   L} ] {$_{s} h\urcorner _t$\\
  87   Node} ]
  88 \Tree [.{$h$} {$_{r} h _{s-1}$\\
  89   L} [.{$_{s} \ulcorner a\urcorner _t$\\
  90   Node} ] ]
  91 * Testing the dependency parsed WSJ
  92 [[file:src/wsjdep.py][wsjdep.py]] uses NLTK (sort of) to get a dependency parsed version of
  93 WSJ10 into the format used in mpp() in loc_h_dmv.py.
  94
  95 As a default, =WSJDepCorpusReader= looks for the file =wsj.combined.10.dep= in
  96 =../corpus/wsjdep=.
  97
  98 Only =sents()=, =tagged_sents()= and =parsed_sents()= (plus a new function
  99 =tagonly_sents()=) are implemented, the other NLTK corpus functions are
 100 ..um.. undefined...
 101 ** TODO [#A] Should =def evaluate= use add_root?
 102 [[file:src/main.py::def%20evaluate%20g%20tagonly_sents%20parsed_sents][main.py]] evaluate
 103 [[file:src/wsjdep.py][wsjdep.py]] add_root
 104
 105 (just has to count how many pairs are in there; Precision and Recall)
 106 * TODO [#C] Alternative CNF for DMV
 107
 108 # <<dmv2cnf>>
 109 - [[file:src/cnf_dmv.py][cnf_dmv.py]]
 110 - [[file:src/cnf_harmonic.py][cnf_harmonic.py]]
 111
 112 See section 5 of [[file:tex/formulas.pdf][formulas.pdf]].
 113
 114 Given a grammar with certain p_ATTACH, p_STOP and p_ROOT, we get:
 115 :>>> print testgrammar_h():
 116 :  h>< -->   h>  STOP   [0.30]
 117 :  h>< -->  >h>  STOP   [0.40]
 118 : _h_  --> STOP    h><  [1.00]
 119 : _h_  --> STOP   <h><  [1.00]
 120 : >h>  -->   h>   _h_   [1.00]
 121 : >h>  -->  >h>   _h_   [1.00]
 122 : <h>< -->  _h_    h><  [0.70]
 123 : <h>< -->  _h_   <h><  [0.60]
 124 :ROOT  --> STOP   _h_   [1.00]
 125
 126
 127 ** TODO program reestimation for cnf_dmv
 128 Not sure how this should be done; we could either do it the way it's
 129 done in Klein and Manning, or take it all the way into Lari &
 130 Young. Possibly do both, to find this stupid bug...
 131 ** TODO dmv2cnf re-estimation formulas
 132 [[file:tex/formulas.tex][tex]]
 133
 134 The reestimation function still has to sum over the various
 135 possibilities of N's and A's; but it seems to be simpler than the
 136 loc_h-method altogether.
 137
 138 Question: Would it be the same thing to reestimate using completely
 139 regular IO reestimation?
 140 ** TODO move as much as possible into common_dmv.py
 141 [[file:src/common_dmv.py][common_dmv.py]]
 142
 143 ...and improve cnf vs loc_h classes (at /least/ give them different names)
 144 ** DONE inner and outer for cnf_dmv.py, also cnf_harmonic.py
 145 * TOGROK Combine CCM with DMV
 146
 147 # <<comboquestions>>
 148
 149 Questions about the =P_COMBO= info in [[http://www.eecs.berkeley.edu/~klein/papers/klein_thesis.pdf][Klein's thesis]]:
 150 - Page 109 (pdf: 125): We have to premultiply "all our probabilities"
 151   by the CCM base product /\Pi_{<i,j>}
 152   P_{SPAN}(\alpha(i,j,s)|false)P_{CONTEXT}(\beta(i,j,s)|false)/; which
 153   probabilities are included under "all"? I'm assuming this includes
 154   =P_ATTACH= since each time =P_ATTACH= is used, /\phi/ is multiplied in
 155   (pp.110-111 ibid.); but /\phi/ is not used for STOPs, so should we not
 156   have our CCM product multiplied in there? How about =P_ROOT=?
 157   (Guessing =P_ORDER= is way out of the question...)
 158 - For the outside probabilities, is it correct to assume we multiply
 159   in /\phi(j,k)/ or /\phi(k,i)/ when calculating =inner(i,j...)=? (Eg., only
 160   for the outside part, not for the whole range.) I don't understand
 161   the notation in =O()= on p.103.
 162 * TOGROK Reestimate P_ORDER ?
 163 * Most Probable Parse
 164 ** TOGROK Find MPP with CCM
 165 ** DONE Find Most Probable Parse of given test sentence, in DMV
 166   CLOSED: [2008-07-23 Wed 10:56]
 167 inner() optionally keeps track of the highest probability children of
 168 any node in =mpptree=. Say we're looking for =inner(i,j,(s_h,h),loc_h)= in
 169 a certain sentence, and we find some possible left and right children,
 170 we add to =mpptree[i,j,(s_h,h),loc_h]= the triple =(p, L, R)= where =L= and
 171 =R= are of the same form as the key (=i,j,(s_h,h),loc_h=) and =p= is the
 172 probability of this node rewriting to =L= and =R=,
 173 eg. =inner(L)*inner(R)*p_GO_AT= or =p_STOP= or whatever. We only add this
 174 entry to =mpptree= if there wasn't a higher-probability entry there
 175 before.
 176
 177 Then, after =inner_sent= makes an =mpptree=, we find the /relevant/
 178 head-argument pairs by searching through the tree using a queue,
 179 adding the =L= and =R= keys of any entry to the queue as we find them
 180 (skipping =STOP= keys), and adding any attachment entries to a set of
 181 triples =(head,argument,dir)=. Thus we have our most probable parse,
 182 eg.
 183 : set([( ROOT, (vbd,2),RIGHT),
 184 :      ((vbd,2),(nn,1),LEFT),
 185 :      ((vbd,2),(nn,3),RIGHT),
 186 :      ((nn,1),(det,0),LEFT)])
 187 * Initialization
 188 [[file:~/Documents/Skole/V08/Probability/dmvccm/src/dmv.py::Initialization%20todo][dmv-inits]]
 189
 190 We go through the corpus, since the probabilities are based on how far
 191 away in the sentence arguments are from their heads.
 192 ** TOGROK CCM Initialization
 193 P_{SPLIT} used here... how, again?
 194 * [#C] Deferred
 195 http://wiki.python.org/moin/PythonSpeed/PerformanceTips Eg., use
 196 map/reduce/filter/[i for i in [i's]]/(i for i in [i's]) instead of
 197 for-loops; use local variables for globals (global variables or or
 198 functions), etc.
 199 ** TODO Clean up reestimation code                                    :PRETTIER:
 200 ** TODO [#A] compare speed of w_left/right(...) and w(LEFT/RIGHT, ...) :OPTIMIZE:
 201 ** TODO when reestimating P_STOP etc, remove rules with p < epsilon   :OPTIMIZE:
 202 ** TODO inner_dmv, short ranges and impossible attachment             :OPTIMIZE:
 203 If s-t <= 2, there can be only one attachment below, so don't recurse
 204 with both Lattach=True and Rattach=True.
 205
 206 If s-t <= 1, there can be no attachment below, so only recurse with
 207 Lattach=False, Rattach=False.
 208
 209 Put this in the loop under rewrite rules (could also do it in the STOP
 210 section, but that would only have an effect on very short sentences).
 211 ** TODO clean up the module files                                     :PRETTIER:
 212 Is there better way to divide dmv and harmonic? There's a two-way
 213 dependency between the modules. Guess there could be a third file that
 214 imports both the initialization and the actual EM stuff, while a file
 215 containing constants and classes could be imported by all others:
 216 : dmv.py imports dmv_EM.py imports dmv_classes.py
 217 : dmv.py imports dmv_inits.py imports dmv_classes.py
 218
 219 ** TOGROK Some (tagged) sentences are bound to come twice             :OPTIMIZE:
 220 Eg, first sort and count, so that the corpus
 221 [['nn','vbd','det','nn'],
 222  ['vbd','nn','det','nn'],
 223  ['nn','vbd','det','nn']]
 224 becomes
 225 [(['nn','vbd','det','nn'],2),
 226  (['vbd','nn','det','nn'],1)]
 227 and then in each loop through sentences, make sure we handle the
 228 frequency correctly.
 229
 230 Is there much to gain here?
 231
 232 ** TOGROK tags as numbers or tags as strings?                         :OPTIMIZE:
 233 Need to clean up the representation.
 234
 235 Stick with tag-strings in initialization then switch to numbers for
 236 IO-algorithm perhaps? Can probably afford more string-matching in
 237 initialization..
 238 * Adjacency and combining it with the inside-outside algorithm
 239 Each DMV_Rule has both a probN and a probA, for adjacencies. inner()
 240 and outer() needs the correct one in each case.
 241
 242 In each inner() call, loc_h is the location of the head of this
 243 dependency structure. In each outer() call, it's the head of the /Node/,
 244 the structure we're looking outside of.
 245
 246 We call inner() for each location of a head, and on each terminal,
 247 loc_h must equal =i= (and =loc_h+1= equal =j=). In the recursive attachment
 248 calls, we use the locations (sentence indices) of words to the left or
 249 right of the head in calls to inner(). /loc_h lets us check whether we
 250 need probN or probA/.
 251 ** Possible alternate type of adjacency
 252 K&M's adjacency is just whether or not an argument has been generated
 253 in the current direction yet. One could also make a stronger type of
 254 adjacency, where h and a are not adjacent if b is in between, eg. with
 255 the sentence "a b h" and the structure ((h->a), (a->b)), h is
 256 K&M-adjacent to a, but not next to a, since b is in between. It's easy
 257 to check this type of adjacency in inner(), but it needs new rules for
 258 P_STOP reestimation.
 259 * Python-stuff
 260 # <<python>>
 261 Make those debug statements steal a bit less attention in emacs:
 262 :(font-lock-add-keywords
 263 : 'python-mode                   ; not really regexp, a bit slow
 264 : '(("^\\( *\\)\\(\\if +'.+' +in +io.DEBUG. *\\(
 265 :\\1    .+$\\)+\\)" 2 font-lock-preprocessor-face t)))
 266 :(font-lock-add-keywords
 267 : 'python-mode
 268 : '(("\\<\\(\\(io\\.\\)?debug(.+)\\)" 1 font-lock-preprocessor-face t)))
 269
 270 - [[file:src/pseudo.py][pseudo.py]]
 271 - http://nltk.org/doc/en/structured-programming.html recursive dynamic
 272 - http://nltk.org/doc/en/advanced-parsing.html
 273 - http://jaynes.colorado.edu/PythonIdioms.html
 274
 275
 276
 277 * Git
 278 Repository web page: http://repo.or.cz/w/dmvccm.git
 279
 280 Setting up a new project:
 281 : git init
 282 : git add .
 283 : git commit -m "first release"
 284
 285 Later on: (=-a= does =git rm= and =git add= automatically)
 286 : git init
 287 : git commit -a -m "some subsequent release"
 288
 289 Then push stuff up to the remote server:
 290 : git push git+ssh://username@repo.or.cz/srv/git/dmvccm.git master
 291
 292 (=eval `ssh-agent`= and =ssh-add= to avoid having to type in keyphrase all
 293 the time)
 294
 295 Make a copy of the (remote) master branch:
 296 : git clone git://repo.or.cz/dmvccm.git
 297
 298 Make and name a new branch in this folder
 299 : git checkout -b mybranch
 300
 301 To save changes in =mybranch=:
 302 : git commit -a
 303
 304 Go back to the master branch (uncommitted changes from =mybranch= are
 305 carried over):
 306 : git checkout master
 307
 308 Try out:
 309 : git add --interactive
 310
 311 Good tutorial:
 312 http://www-cs-students.stanford.edu/~blynn//gitmagic/