1 <!DOCTYPE html PUBLIC
"-//W3C//DTD XHTML 1.0 Strict//EN"
2 "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
3 <html xmlns=
"http://www.w3.org/1999/xhtml"
4 lang=
"en" xml:
lang=
"en">
6 <title>DMV/CCM
– todo-list / progress
</title>
7 <meta http-equiv=
"Content-Type" content=
"text/html;charset=utf-8"/>
8 <meta name=
"generator" content=
"Org-mode"/>
9 <meta name=
"generated" content=
"2008/06/10 12:33:43"/>
10 <meta name=
"author" content=
"Kevin Brubeck Unhammer"/>
11 <link rel=
"stylesheet" type=
"text/css" href=
"http://www.student.uib.no/~kun041/org.css">
13 <h1 class=
"title">DMV/CCM
– todo-list / progress
</h1>
16 <div id=
"table-of-contents">
17 <h2>Table of Contents
</h2>
18 <div id=
"text-table-of-contents">
20 <li><a href=
"#sec-1">1 dmvccm report and project
</a></li>
21 <li><a href=
"#sec-2">2 [#A] P_STOP and P_CHOOSE for IO/EM
</a></li>
22 <li><a href=
"#sec-3">3 Find Most Probable Parse of given test sentence
</a></li>
23 <li><a href=
"#sec-4">4 Combine CCM with DMV
</a></li>
24 <li><a href=
"#sec-5">5 Get a dependency parsed version of WSJ to test with
</a></li>
25 <li><a href=
"#sec-6">6 Initialization
</a></li>
26 <li><a href=
"#sec-7">7 [#C] Deferred
</a></li>
27 <li><a href=
"#sec-8">8 Adjacency and combining it with inner()
</a></li>
28 <li><a href=
"#sec-9">9 Expectation Maximation in IO/DMV-terms
</a></li>
29 <li><a href=
"#sec-10">10 Python-stuff
</a></li>
30 <li><a href=
"#sec-11">11 Git
</a></li>
35 <div id=
"outline-container-1" class=
"outline-2">
36 <h2 id=
"sec-1">1 dmvccm report and project
</h2>
39 <p><span class=
"timestamp-kwd">DEADLINE:
</span> <span class=
"timestamp">2008-
06-
30 Mon
</span><br/>
42 <a href=
"src/dmv.py">dmv.py
</a>
45 <a href=
"src/io.py">io.py
</a>
48 <a href=
"src/harmonic.py">harmonic.py
</a>
53 <p>(But absolute, extended, really-quite-dead-now deadline: August
31…)
58 <div id=
"outline-container-2" class=
"outline-2">
59 <h2 id=
"sec-2">2 <span class=
"todo">TODO
</span> [#A] P_STOP and P_CHOOSE for IO/EM
</h2>
62 <p><a href=
"src/dmv.py">dmv-P_STOP
</a>
63 Remember: The P
<sub>STOP
</sub> formula is upside-down (left-to-right also).
64 (In the article..not the thesis)
67 Remember: Initialization makes some
"short-cut" rules, these will also
68 have to be updated along with the other P
<sub>STOP
</sub> updates:
71 b[(NOBAR, n
<sub>h
</sub>), 'h'] =
1.0 # always
74 b[(RBAR, n
<sub>h
</sub>), 'h'] = h_.probA # h_ is RBAR stop rule
77 b[(LRBAR, n
<sub>h
</sub>), 'h'] = h_.probA * _ h_.probA
80 <li id=
"sec-2.1">P_STOP formulas for various dir and adj:
<br/>
84 P
<sub>STOP
</sub>(STOP|h,L,non_adj) =
∑<sub>corpus
</sub> ∑<sub>s
<loc(h)
</sub> ∑<sub>t
</sub>
85 inner(s,t,_h_,
…) /
∑<sub>corpus
</sub> ∑<sub>s
<loc(h)
</sub> ∑<sub>t
</sub>
86 inner(s,t,h_,
…)
89 P
<sub>STOP
</sub>(STOP|h,L,adj) =
∑<sub>corpus
</sub> ∑<sub>s=loc(h)
</sub> ∑<sub>t
</sub>
90 inner(s,t,_h_,
…) /
∑<sub>corpus
</sub> ∑<sub>s=loc(h)
</sub> ∑<sub>t
</sub>
91 inner(s,t,h_,
…)
94 P
<sub>STOP
</sub>(STOP|h,R,non_adj) =
∑<sub>corpus
</sub> ∑<sub>s
</sub> ∑<sub>t
>loc(h)
</sub>
95 inner(s,t,_h_,
…) /
∑<sub>corpus
</sub> ∑<sub>s
</sub> ∑<sub>t
>loc(h)
</sub>
96 inner(s,t,h_,
…)
99 P
<sub>STOP
</sub>(STOP|h,R,adj) =
∑<sub>corpus
</sub> ∑<sub>s
</sub> ∑<sub>t=loc(h)
</sub>
100 inner(s,t,_h_,
…) /
∑<sub>corpus
</sub> ∑<sub>s
</sub> ∑<sub>t=loc(h)
</sub>
101 inner(s,t,h_,
…)
106 <p>(And P
<sub>STOP
</sub>(-STOP|
…) =
1 - P
<sub>STOP
</sub>(STOP|
…) )
108 <li id=
"sec-2.2">P_CHOOSE formula:
<br/>
112 P
<sub>CHOOSE
</sub>(a|h,L) =
∑<sub>corpus
</sub> ∑<sub>s
<loc(h)
</sub> ∑<sub>t
>=loc(h)
</sub> ∑<sub>r
<t
</sub>
113 inner(s,r,_a_,
…) /
∑<sub>corpus
</sub> ∑<sub>s
<loc(h)
</sub> ∑<sub>t
>=loc(h)
</sub>
114 inner(s,t,h_,
…)
117 t
>= loc(h) since there are many possibilites for
118 right-attachments below, and each of them alone gives a lower
119 probability (through multiplication) to the upper tree (so sum
125 P
<sub>CHOOSE
</sub>(a|h,R) =
∑<sub>corpus
</sub> ∑<sub>s=loc(h)
</sub> ∑<sub>r
>s
</sub>
126 inner(s,r,_a_,
…) /
∑<sub>corpus
</sub> ∑<sub>s=loc(h)
</sub> ∑<sub>t
</sub>
127 inner(s,t,h,
…)
138 <div id=
"outline-container-3" class=
"outline-2">
139 <h2 id=
"sec-3">3 <span class=
"todo">TOGROK
</span> Find Most Probable Parse of given test sentence
</h2>
142 <p>We could probably do it pretty easily from the chart:
145 for all loc_h, select the (
0,len(sent),ROOT,loc_h) that has the highest p
148 for stops, just keep going
151 for rewrites, for loc_dep select the one that has the highest p
158 <div id=
"outline-container-4" class=
"outline-2">
159 <h2 id=
"sec-4">4 <span class=
"todo">TOGROK
</span> Combine CCM with DMV
</h2>
166 <div id=
"outline-container-5" class=
"outline-2">
167 <h2 id=
"sec-5">5 <span class=
"todo">TODO
</span> Get a dependency parsed version of WSJ to test with
</h2>
174 <div id=
"outline-container-6" class=
"outline-2">
175 <h2 id=
"sec-6">6 Initialization
</h2>
178 <p><a href=
"/Users/kiwibird/Documents/Skole/V08/Probability/dmvccm/src/dmv.py">dmv-inits
</a>
181 We do have to go through the corpus, since the probabilities are based
182 on how far away in the sentence arguments are from their heads.
184 <li id=
"sec-6.1"><span class=
"todo">TOGROK
</span> CCM Initialization
<br/>
185 P
<sub>SPLIT
</sub> used here
… how, again?
187 <li id=
"sec-6.2"><span class=
"done">DONE
</span> Separate initialization to another file?
<span class=
"tag">PRETTIER
</span><br/>
188 <span class=
"timestamp-kwd">CLOSED:
</span> <span class=
"timestamp">2008-
06-
08 Sun
12:
51</span><br/>
189 <a href=
"src/harmonic.py">harmonic.py
</a>
191 <li id=
"sec-6.3"><span class=
"done">DONE
</span> DMV Initialization probabilities
<br/>
192 (from initialization frequency)
194 <li id=
"sec-6.4"><span class=
"done">DONE
</span> DMV Initialization frequencies
<br/>
195 <span class=
"timestamp-kwd">CLOSED:
</span> <span class=
"timestamp">2008-
05-
27 Tue
20:
04</span><br/>
197 <li id=
"sec-6.4.1">P_STOP
<br/>
198 P
<sub>STOP
</sub> is not well defined by K
&M. One possible interpretation given
199 the sentence [det nn vb nn] is
201 f_{STOP}( STOP|det, L, adj) +
1
202 f_{STOP}(-STOP|det, L, adj) +
0
203 f_{STOP}( STOP|det, L, non_adj) +
1
204 f_{STOP}(-STOP|det, L, non_adj) +
0
205 f_{STOP}( STOP|det, R, adj) +
0
206 f_{STOP}(-STOP|det, R, adj) +
1
208 f_{STOP}( STOP|nn, L, adj) +
0
209 f_{STOP}(-STOP|nn, L, adj) +
1
210 f_{STOP}( STOP|nn, L, non_adj) +
1 # since there's at least one to the left
211 f_{STOP}(-STOP|nn, L, non_adj) +
0
214 <li id=
"sec-6.4.1.1"><span class=
"todo">TODO
</span> tweak
<br/>
216 f[head, 'STOP', 'LN'] += (i_h
<=
1) # first two words
217 f[head, '-STOP', 'LN'] += (not i_h
<=
1)
218 f[head, 'STOP', 'LA'] += (i_h ==
0) # very first word
219 f[head, '-STOP', 'LA'] += (not i_h ==
0)
220 f[head, 'STOP', 'RN'] += (i_h
>= n -
2) # last two words
221 f[head, '-STOP', 'RN'] += (not i_h
>= n -
2)
222 f[head, 'STOP', 'RA'] += (i_h == n -
1) # very last word
223 f[head, '-STOP', 'RA'] += (not i_h == n -
1)
227 # this one requires some additional rewriting since it
228 # introduces divisions by zero
229 f[head, 'STOP', 'LN'] += (i_h ==
1) # second word
230 f[head, '-STOP', 'LN'] += (not i_h
<=
1) # not first two
231 f[head, 'STOP', 'LA'] += (i_h ==
0) # first word
232 f[head, '-STOP', 'LA'] += (not i_h ==
0) # not first
233 f[head, 'STOP', 'RN'] += (i_h == n -
2) # second-to-last
234 f[head, '-STOP', 'RN'] += (not i_h
>= n -
2) # not last two
235 f[head, 'STOP', 'RA'] += (i_h == n -
1) # last word
236 f[head, '-STOP', 'RA'] += (not i_h == n -
1) # not last
240 f[head, 'STOP', 'LN'] += (i_h ==
1) # second word
241 f[head, '-STOP', 'LN'] += (not i_h ==
1) # not second
242 f[head, 'STOP', 'LA'] += (i_h ==
0) # first word
243 f[head, '-STOP', 'LA'] += (not i_h ==
0) # not first
244 f[head, 'STOP', 'RN'] += (i_h == n -
2) # second-to-last
245 f[head, '-STOP', 'RN'] += (not i_h == n -
2) # not second-to-last
246 f[head, 'STOP', 'RA'] += (i_h == n -
1) # last word
247 f[head, '-STOP', 'RA'] += (not i_h == n -
1) # not last
250 "all words take the same number of arguments" interpreted as
253 p_STOP(head, 'STOP', 'LN') =
0.3
254 p_STOP(head, 'STOP', 'LA') =
0.5
255 p_STOP(head, 'STOP', 'RN') =
0.4
256 p_STOP(head, 'STOP', 'RA') =
0.7
258 (which we easily may tweak in init_zeros())
262 <li id=
"sec-6.4.2">P_CHOOSE
<br/>
263 Go through the corpus, counting distances between heads and
264 arguments. In [det nn vb nn], we give
267 f
<sub>CHOOSE
</sub>(nn|det, R) +
1/
1 + C
270 f
<sub>CHOOSE
</sub>(vb|det, R) +
1/
2 + C
273 f
<sub>CHOOSE
</sub>(nn|det, R) +
1/
3 + C
276 If this were the full corpus, P
<sub>CHOOSE
</sub>(nn|det, R) would have
277 (
1+
1/
3+
2C) / sum_a f
<sub>CHOOSE
</sub>(a|det, R)
284 <p>The ROOT gets
"each argument with equal probability", so in a sentence
285 of three words,
1/
3 for each (in [nn vb nn], 'nn' gets
2/
3). Basically
286 just a frequency count of the corpus
…
295 <div id=
"outline-container-7" class=
"outline-2">
296 <h2 id=
"sec-7">7 [#C] Deferred
</h2>
299 <p><a href=
"http://wiki.python.org/moin/PythonSpeed/PerformanceTips">http://wiki.python.org/moin/PythonSpeed/PerformanceTips
</a> Eg., use
300 map/reduce/filter/[i for i in [i's]]/(i for i in [i's]) instead of
301 for-loops; use local variables for globals (global variables or or
305 <li id=
"sec-7.1"><span class=
"todo">TODO
</span> if loc_h == t, no need to try right-attachment rules
<span class=
"tag">OPTIMIZE
</span><br/>
306 ** TODO if loc_h == s, no need to try left-attachment rules :OPTIMIZE:
308 <li id=
"sec-7.3"><span class=
"todo">TODO
</span> when reestimating P_STOP etc, remove rules with p
< epsilon
<span class=
"tag">OPTIMIZE
</span><br/>
310 <li id=
"sec-7.4"><span class=
"todo">TODO
</span> inner_dmv, short ranges and impossible attachment
<span class=
"tag">OPTIMIZE
</span><br/>
311 If s-t
<=
2, there can be only one attachment below, so don't recurse
312 with both Lattach=True and Rattach=True.
315 If s-t
<=
1, there can be no attachment below, so only recurse with
316 Lattach=False, Rattach=False.
319 Put this in the loop under rewrite rules (could also do it in the STOP
320 section, but that would only have an effect on very short sentences).
322 <li id=
"sec-7.5"><span class=
"todo">TODO
</span> clean up the module files
<span class=
"tag">PRETTIER
</span><br/>
323 Is there better way to divide dmv and harmonic? There's a two-way
324 dependency between the modules. Guess there could be a third file that
325 imports both the initialization and the actual EM stuff, while a file
326 containing constants and classes could be imported by all others:
328 dmv.py imports dmv_EM.py imports dmv_classes.py
329 dmv.py imports dmv_inits.py imports dmv_classes.py
333 <li id=
"sec-7.6"><span class=
"todo">TOGROK
</span> Some (tagged) sentences are bound to come twice
<span class=
"tag">OPTIMIZE
</span><br/>
334 Eg, first sort and count, so that the corpus
335 [['nn','vbd','det','nn'],
336 ['vbd','nn','det','nn'],
337 ['nn','vbd','det','nn']]
339 [(['nn','vbd','det','nn'],
2),
340 (['vbd','nn','det','nn'],
1)]
341 and then in each loop through sentences, make sure we handle the
345 Is there much to gain here?
348 <li id=
"sec-7.7"><span class=
"todo">TOGROK
</span> tags as numbers or tags as strings?
<span class=
"tag">OPTIMIZE
</span><br/>
349 Need to clean up the representation.
352 Stick with tag-strings in initialization then switch to numbers for
353 IO-algorithm perhaps? Can probably afford more string-matching in
356 <li id=
"sec-7.8"><span class=
"done">DONE
</span> io.debug parameters should not call functions
<span class=
"tag">OPTIMIZE
</span><br/>
357 <span class=
"timestamp-kwd">CLOSED:
</span> <span class=
"timestamp">2008-
06-
10 Tue
12:
26</span><br/>
358 Exchanged all io.debug(str,'level') calls with statements of the form:
360 if 'level' in io.DEBUG:
365 and got an almost threefold speed increase on inner().
367 <li id=
"sec-7.9"><span class=
"done">DONE
</span> inner_dmv() should disregard rules with heads not in sent
<span class=
"tag">OPTIMIZE
</span><br/>
368 <span class=
"timestamp-kwd">CLOSED:
</span> <span class=
"timestamp">2008-
06-
08 Sun
10:
18</span><br/>
369 If the sentence is
"nn vbd det nn", we should not even look at rules
372 rule.head() not in
"nn vbd det nn".split()
374 This is ruled out by getting rules from g.rules(LHS, sent).
377 Also, we optimize this further by saying we don't even recurse into
378 attachment rules where
380 rule.head() not in sent[ s :r+
1]
381 rule.head() not in sent[r+
1:t+
1]
383 meaning, if we're looking at the span
"vbd det", we only use
384 attachment rules where both daughters are members of ['vbd','det']
385 (although we don't (yet) care about removing rules that rewrite to the
386 same tag if there are no duplicate tags in the span, etc., that would
387 be a lot of trouble for little potential gain).
394 <div id=
"outline-container-8" class=
"outline-2">
395 <h2 id=
"sec-8">8 Adjacency and combining it with inner()
</h2>
398 <p>Each DMV_Rule now has both a probN and a probA, for
399 adjacencies. inner() needs the correct one in each case.
402 Adjacency gives a problem with duplicate words/tags, eg. in the
403 sentence
"a a b". If this has the dependency structure b-
>a
<sub>0</sub>-
>a
<sub>1</sub>,
404 then b is non-adjacent to a
<sub>0</sub> and should use probN (for the LRStop and
405 the attachment of a
<sub>0</sub>), while the other rules should all use
406 probA. But within the e(
0,
2,b) we can't just say
"oh, a has index 0
407 so it's not adjacent to 2", since there's also an a at index
1, and
408 there's also a dependency structure b-
>a
<sub>1</sub>-
>a
<sub>0</sub> for that. We want
409 both. And in possibly much more complex versions.
412 So now we call inner() for each location of a head, and on each
413 terminal, loc_h must equal s (and t). In the recursive attachment
414 calls, we use the locations (sentence indices) of words to the left or
415 right of the head in calls to inner(). loc_h lets us check whether we
419 In each inner() call, loc_h is the location of the root of this
420 dependency structure.
422 <li id=
"sec-8.1">Possible alternate type of adjacency
<br/>
423 K
&M's adjacency is just whether or not an argument has been generated
424 in the current direction yet. One could also make a stronger type of
425 adjacency, where h and a are not adjacent if b is in between, eg. with
426 the sentence
"a b h" and the structure ((h-
>a), (a-
>b)), h is
427 K
&M-adjacent to a, but not next to a, since b is in between. It's easy
428 to check this type of adjacency in inner(), but it needs new rules for
436 <div id=
"outline-container-9" class=
"outline-2">
437 <h2 id=
"sec-9">9 Expectation Maximation in IO/DMV-terms
</h2>
440 <p>inner(s,t,LHS) calculates the expected number of trees headed by LHS
441 from s to t (sentence positions). This uses the P_STOP and P_CHOOSE
442 values, which have been conveniently distributed into CNF rules as
443 probN and probA (non-adjacent and adjacent probabilites).
446 When re-estimating, we use the expected values from inner() to get new
447 values for P_STOP and P_CHOOSE. When we've re-estimated for the entire
448 corpus, we distribute P_STOP and P_CHOOSE into the CNF rules again, so
449 that in the next round we use new probN and probA to find
453 The distribution of P_STOP and P_CHOOSE into CNF rules also happens in
454 init_normalize() (here along with the creation of P_STOP and
455 P_CHOOSE); P_STOP is used to create CNF rules where one branch of the
456 rule is STOP, P_CHOOSE is used to create rules of the form
463 Since
"adjacency" is not captured in regular CNF rules, we need two
464 probabilites for each rule, and inner() has to know when to use which.
467 <li id=
"sec-9.1"><span class=
"todo">TODO
</span> Corpus access
<br/>
469 <li id=
"sec-9.2"><span class=
"todo">TOGROK
</span> sentences or rules as the
"outer loop"?
<span class=
"tag">OPTIMIZE
</span><br/>
470 In regard to the E/M-step, finding P
<sub>STOP
</sub>, P
<sub>CHOOSE
</sub>.
479 <div id=
"outline-container-10" class=
"outline-2">
480 <h2 id=
"sec-10">10 Python-stuff
</h2>
483 <p>Make those debug statements steal a bit less attention in emacs:
485 (font-lock-add-keywords
486 'python-mode ; not really regexp, a bit slow
487 '((
"^\\( *\\)\\(\\if +'.+' +in +io.DEBUG. *\\(
488 \\1 .+$\\)+\\)" 2 font-lock-preprocessor-face t)))
489 (font-lock-add-keywords
491 '((
"\\<\\(\\(io\\.\\)?debug(.+)\\)" 1 font-lock-preprocessor-face t)))
496 <a href=
"src/pseudo.py">pseudo.py
</a>
499 <a href=
"http://nltk.org/doc/en/structured-programming.html">http://nltk.org/doc/en/structured-programming.html
</a> recursive dynamic
502 <a href=
"http://nltk.org/doc/en/advanced-parsing.html">http://nltk.org/doc/en/advanced-parsing.html
</a>
505 <a href=
"http://jaynes.colorado.edu/PythonIdioms.html">http://jaynes.colorado.edu/PythonIdioms.html
</a>
515 <div id=
"outline-container-11" class=
"outline-2">
516 <h2 id=
"sec-11">11 Git
</h2>
519 <p>Repository web page:
<a href=
"http://repo.or.cz/w/dmvccm.git">http://repo.or.cz/w/dmvccm.git
</a>
522 Setting up a new project:
526 git commit -m
"first release"
530 Later on: (
<code>-a
</code> does
<code>git rm
</code> and
<code>git add
</code> automatically)
533 git commit -a -m
"some subsequent release"
537 Then push stuff up to the remote server:
539 git push git+ssh://username@repo.or.cz/srv/git/dmvccm.git master
543 (
<code>eval `ssh-agent`
</code> and
<code>ssh-add
</code> to avoid having to type in keyphrase all
547 Make a copy of the (remote) master branch:
549 git clone git://repo.or.cz/dmvccm.git
553 Make and name a new branch in this folder
555 git checkout -b mybranch
559 To save changes in
<code>mybranch
</code>:
565 Go back to the master branch (uncommitted changes from
<code>mybranch
</code> are
574 git add --interactive
579 <a href=
"http://www-cs-students.stanford.edu/~blynn//gitmagic/">http://www-cs-students.stanford.edu/~blynn//gitmagic/
</a>
582 <div id=
"postamble"><p class=
"author"> Author: Kevin Brubeck Unhammer
583 <a href=
"mailto:K.BrubeckUnhammer at student uva nl "><K.BrubeckUnhammer at student uva nl
></a>
585 <p class=
"date"> Date:
2008/
06/
10 12:
33:
43</p>
586 </div><p class=
"postamble">Skrive vha. emacs +
<a href='http://orgmode.org/'
>org-mode
</a></p></body>