1 <!DOCTYPE html PUBLIC
"-//W3C//DTD XHTML 1.0 Strict//EN"
2 "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
3 <html xmlns=
"http://www.w3.org/1999/xhtml"
4 lang=
"en" xml:
lang=
"en">
6 <title>DMV/CCM
– todo-list / progress
</title>
7 <meta http-equiv=
"Content-Type" content=
"text/html;charset=utf-8"/>
8 <meta name=
"generator" content=
"Org-mode"/>
9 <meta name=
"generated" content=
"2008/06/09 13:04:26"/>
10 <meta name=
"author" content=
"Kevin Brubeck Unhammer"/>
11 <link rel=
"stylesheet" type=
"text/css" href=
"http://www.student.uib.no/~kun041/org.css">
13 <h1 class=
"title">DMV/CCM
– todo-list / progress
</h1>
16 <div id=
"table-of-contents">
17 <h2>Table of Contents
</h2>
18 <div id=
"text-table-of-contents">
20 <li><a href=
"#sec-1">1 dmvccm report and project
</a></li>
21 <li><a href=
"#sec-2">2 [#A] P_STOP and P_CHOOSE for IO/EM
</a></li>
22 <li><a href=
"#sec-3">3 Find Most Probable Parse of given test sentence
</a></li>
23 <li><a href=
"#sec-4">4 Combine CCM with DMV
</a></li>
24 <li><a href=
"#sec-5">5 Get a dependency parsed version of WSJ to test with
</a></li>
25 <li><a href=
"#sec-6">6 Initialization
</a></li>
26 <li><a href=
"#sec-7">7 [#C] Deferred
</a></li>
27 <li><a href=
"#sec-8">8 Adjacency and combining it with inner()
</a></li>
28 <li><a href=
"#sec-9">9 Expectation Maximation in IO/DMV-terms
</a></li>
29 <li><a href=
"#sec-10">10 Python-stuff
</a></li>
30 <li><a href=
"#sec-11">11 Git
</a></li>
35 <div id=
"outline-container-1" class=
"outline-2">
36 <h2 id=
"sec-1">1 dmvccm report and project
</h2>
39 <p><span class=
"timestamp-kwd">DEADLINE:
</span> <span class=
"timestamp">2008-
06-
30 Mon
</span><br/>
40 But absolute, extended, really-quite-dead-now deadline: August
44 <a href=
"src/dmv.py">dmv.py
</a>
47 <a href=
"src/io.py">io.py
</a>
50 <a href=
"src/harmonic.py">harmonic.py
</a>
59 <div id=
"outline-container-2" class=
"outline-2">
60 <h2 id=
"sec-2">2 <span class=
"todo">TODO
</span> [#A] P_STOP and P_CHOOSE for IO/EM
</h2>
63 <p><a href=
"src/dmv.py">dmv-P_STOP
</a>
64 Remember: The P
<sub>STOP
</sub> formula is upside-down (left-to-right also).
65 (In the article..not the thesis)
68 Remember: Initialization makes some
"short-cut" rules, these will also
69 have to be updated along with the other P
<sub>STOP
</sub> updates:
72 b[(NOBAR, n
<sub>h
</sub>), 'h'] =
1.0 # always
75 b[(RBAR, n
<sub>h
</sub>), 'h'] = h_.probA # h_ is RBAR stop rule
78 b[(LRBAR, n
<sub>h
</sub>), 'h'] = h_.probA * _ h_.probA
81 <li id=
"sec-2.1">P_STOP formulas for various dir and adj:
<br/>
85 P
<sub>STOP
</sub>(STOP|h,L,non_adj) =
∑<sub>corpus
</sub> ∑<sub>s
<loc(h)
</sub> ∑<sub>t
</sub>
86 inner(s,t,_h_,
…) /
∑<sub>corpus
</sub> ∑<sub>s
<loc(h)
</sub> ∑<sub>t
</sub>
87 inner(s,t,h_,
…)
90 P
<sub>STOP
</sub>(STOP|h,L,adj) =
∑<sub>corpus
</sub> ∑<sub>s=loc(h)
</sub> ∑<sub>t
</sub>
91 inner(s,t,_h_,
…) /
∑<sub>corpus
</sub> ∑<sub>s=loc(h)
</sub> ∑<sub>t
</sub>
92 inner(s,t,h_,
…)
95 P
<sub>STOP
</sub>(STOP|h,R,non_adj) =
∑<sub>corpus
</sub> ∑<sub>s
</sub> ∑<sub>t
>loc(h)
</sub>
96 inner(s,t,_h_,
…) /
∑<sub>corpus
</sub> ∑<sub>s
</sub> ∑<sub>t
>loc(h)
</sub>
97 inner(s,t,h_,
…)
100 P
<sub>STOP
</sub>(STOP|h,R,adj) =
∑<sub>corpus
</sub> ∑<sub>s
</sub> ∑<sub>t=loc(h)
</sub>
101 inner(s,t,_h_,
…) /
∑<sub>corpus
</sub> ∑<sub>s
</sub> ∑<sub>t=loc(h)
</sub>
102 inner(s,t,h_,
…)
107 <p>(And P
<sub>STOP
</sub>(-STOP|
…) =
1 - P
<sub>STOP
</sub>(STOP|
…) )
109 <li id=
"sec-2.2">P_CHOOSE formula:
<br/>
113 P
<sub>CHOOSE
</sub>(a|h,L) =
∑<sub>corpus
</sub> ∑<sub>s
<loc(h)
</sub> ∑<sub>t
>=loc(h)
</sub> ∑<sub>r
<t
</sub>
114 inner(s,r,_a_,
…) /
∑<sub>corpus
</sub> ∑<sub>s
<loc(h)
</sub> ∑<sub>t
>=loc(h)
</sub>
115 inner(s,t,h_,
…)
118 t
>= loc(h) since there are many possibilites for
119 right-attachments below, and each of them alone gives a lower
120 probability (through multiplication) to the upper tree (so sum
126 P
<sub>CHOOSE
</sub>(a|h,R) =
∑<sub>corpus
</sub> ∑<sub>s=loc(h)
</sub> ∑<sub>r
>s
</sub>
127 inner(s,r,_a_,
…) /
∑<sub>corpus
</sub> ∑<sub>s=loc(h)
</sub> ∑<sub>t
</sub>
128 inner(s,t,h,
…)
139 <div id=
"outline-container-3" class=
"outline-2">
140 <h2 id=
"sec-3">3 <span class=
"todo">TOGROK
</span> Find Most Probable Parse of given test sentence
</h2>
143 <p>We could probably do it pretty easily from the chart:
146 for all loc_h, select the (
0,len(sent),ROOT,loc_h) that has the highest p
149 for stops, just keep going
152 for rewrites, for loc_dep select the one that has the highest p
159 <div id=
"outline-container-4" class=
"outline-2">
160 <h2 id=
"sec-4">4 <span class=
"todo">TOGROK
</span> Combine CCM with DMV
</h2>
167 <div id=
"outline-container-5" class=
"outline-2">
168 <h2 id=
"sec-5">5 <span class=
"todo">TODO
</span> Get a dependency parsed version of WSJ to test with
</h2>
175 <div id=
"outline-container-6" class=
"outline-2">
176 <h2 id=
"sec-6">6 Initialization
</h2>
179 <p><a href=
"/Users/kiwibird/Documents/Skole/V08/Probability/dmvccm/src/dmv.py">dmv-inits
</a>
182 We do have to go through the corpus, since the probabilities are based
183 on how far away in the sentence arguments are from their heads.
185 <li id=
"sec-6.1"><span class=
"todo">TOGROK
</span> CCM Initialization
<br/>
186 P
<sub>SPLIT
</sub> used here
… how, again?
188 <li id=
"sec-6.2"><span class=
"done">DONE
</span> Separate initialization to another file?
<span class=
"tag">PRETTIER
</span><br/>
189 <span class=
"timestamp-kwd">CLOSED:
</span> <span class=
"timestamp">2008-
06-
08 Sun
12:
51</span><br/>
190 <a href=
"src/harmonic.py">harmonic.py
</a>
192 <li id=
"sec-6.3"><span class=
"done">DONE
</span> DMV Initialization probabilities
<br/>
193 (from initialization frequency)
195 <li id=
"sec-6.4"><span class=
"done">DONE
</span> DMV Initialization frequencies
<br/>
196 <span class=
"timestamp-kwd">CLOSED:
</span> <span class=
"timestamp">2008-
05-
27 Tue
20:
04</span><br/>
198 <li id=
"sec-6.4.1">P_STOP
<br/>
199 P
<sub>STOP
</sub> is not well defined by K
&M. One possible interpretation given
200 the sentence [det nn vb nn] is
202 f_{STOP}( STOP|det, L, adj) +
1
203 f_{STOP}(-STOP|det, L, adj) +
0
204 f_{STOP}( STOP|det, L, non_adj) +
1
205 f_{STOP}(-STOP|det, L, non_adj) +
0
206 f_{STOP}( STOP|det, R, adj) +
0
207 f_{STOP}(-STOP|det, R, adj) +
1
209 f_{STOP}( STOP|nn, L, adj) +
0
210 f_{STOP}(-STOP|nn, L, adj) +
1
211 f_{STOP}( STOP|nn, L, non_adj) +
1 # since there's at least one to the left
212 f_{STOP}(-STOP|nn, L, non_adj) +
0
215 <li id=
"sec-6.4.1.1"><span class=
"todo">TODO
</span> tweak
<br/>
217 f[head, 'STOP', 'LN'] += (i_h
<=
1) # first two words
218 f[head, '-STOP', 'LN'] += (not i_h
<=
1)
219 f[head, 'STOP', 'LA'] += (i_h ==
0) # very first word
220 f[head, '-STOP', 'LA'] += (not i_h ==
0)
221 f[head, 'STOP', 'RN'] += (i_h
>= n -
2) # last two words
222 f[head, '-STOP', 'RN'] += (not i_h
>= n -
2)
223 f[head, 'STOP', 'RA'] += (i_h == n -
1) # very last word
224 f[head, '-STOP', 'RA'] += (not i_h == n -
1)
228 # this one requires some additional rewriting since it
229 # introduces divisions by zero
230 f[head, 'STOP', 'LN'] += (i_h ==
1) # second word
231 f[head, '-STOP', 'LN'] += (not i_h
<=
1) # not first two
232 f[head, 'STOP', 'LA'] += (i_h ==
0) # first word
233 f[head, '-STOP', 'LA'] += (not i_h ==
0) # not first
234 f[head, 'STOP', 'RN'] += (i_h == n -
2) # second-to-last
235 f[head, '-STOP', 'RN'] += (not i_h
>= n -
2) # not last two
236 f[head, 'STOP', 'RA'] += (i_h == n -
1) # last word
237 f[head, '-STOP', 'RA'] += (not i_h == n -
1) # not last
241 f[head, 'STOP', 'LN'] += (i_h ==
1) # second word
242 f[head, '-STOP', 'LN'] += (not i_h ==
1) # not second
243 f[head, 'STOP', 'LA'] += (i_h ==
0) # first word
244 f[head, '-STOP', 'LA'] += (not i_h ==
0) # not first
245 f[head, 'STOP', 'RN'] += (i_h == n -
2) # second-to-last
246 f[head, '-STOP', 'RN'] += (not i_h == n -
2) # not second-to-last
247 f[head, 'STOP', 'RA'] += (i_h == n -
1) # last word
248 f[head, '-STOP', 'RA'] += (not i_h == n -
1) # not last
251 "all words take the same number of arguments" interpreted as
254 p_STOP(head, 'STOP', 'LN') =
0.3
255 p_STOP(head, 'STOP', 'LA') =
0.5
256 p_STOP(head, 'STOP', 'RN') =
0.4
257 p_STOP(head, 'STOP', 'RA') =
0.7
259 (which we easily may tweak in init_zeros())
263 <li id=
"sec-6.4.2">P_CHOOSE
<br/>
264 Go through the corpus, counting distances between heads and
265 arguments. In [det nn vb nn], we give
268 f
<sub>CHOOSE
</sub>(nn|det, R) +
1/
1 + C
271 f
<sub>CHOOSE
</sub>(vb|det, R) +
1/
2 + C
274 f
<sub>CHOOSE
</sub>(nn|det, R) +
1/
3 + C
277 If this were the full corpus, P
<sub>CHOOSE
</sub>(nn|det, R) would have
278 (
1+
1/
3+
2C) / sum_a f
<sub>CHOOSE
</sub>(a|det, R)
285 <p>The ROOT gets
"each argument with equal probability", so in a sentence
286 of three words,
1/
3 for each (in [nn vb nn], 'nn' gets
2/
3). Basically
287 just a frequency count of the corpus
…
296 <div id=
"outline-container-7" class=
"outline-2">
297 <h2 id=
"sec-7">7 [#C] Deferred
</h2>
301 <li id=
"sec-7.1"><span class=
"todo">TODO
</span> if loc_h == t, no need to try right-attachment rules
<span class=
"tag">OPTIMIZE
</span><br/>
302 ** TODO if loc_h == s, no need to try left-attachment rules :OPTIMIZE:
304 <li id=
"sec-7.3"><span class=
"todo">TODO
</span> when reestimating P_STOP etc, remove rules with p
< epsilon
<span class=
"tag">OPTIMIZE
</span><br/>
306 <li id=
"sec-7.4"><span class=
"todo">TODO
</span> inner_dmv, short ranges and impossible attachment
<span class=
"tag">OPTIMIZE
</span><br/>
307 If s-t
<=
2, there can be only one attachment below, so don't recurse
308 with both Lattach=True and Rattach=True.
311 If s-t
<=
1, there can be no attachment below, so only recurse with
312 Lattach=False, Rattach=False.
315 Put this in the loop under rewrite rules (could also do it in the STOP
316 section, but that would only have an effect on very short sentences).
318 <li id=
"sec-7.5"><span class=
"todo">TODO
</span> clean up the module files
<span class=
"tag">PRETTIER
</span><br/>
319 Is there better way to divide dmv and harmonic? There's a two-way
320 dependency between the modules. Guess there could be a third file that
321 imports both the initialization and the actual EM stuff, while a file
322 containing constants and classes could be imported by all others:
324 dmv.py imports dmv_EM.py imports dmv_classes.py
325 dmv.py imports dmv_inits.py imports dmv_classes.py
329 <li id=
"sec-7.6"><span class=
"todo">TOGROK
</span> Some (tagged) sentences are bound to come twice
<span class=
"tag">OPTIMIZE
</span><br/>
330 Eg, first sort and count, so that the corpus
331 [['nn','vbd','det','nn'],
332 ['vbd','nn','det','nn'],
333 ['nn','vbd','det','nn']]
335 [(['nn','vbd','det','nn'],
2),
336 (['vbd','nn','det','nn'],
1)]
337 and then in each loop through sentences, make sure we handle the
341 Is there much to gain here?
344 <li id=
"sec-7.7"><span class=
"todo">TOGROK
</span> tags as numbers or tags as strings?
<span class=
"tag">OPTIMIZE
</span><br/>
345 Need to clean up the representation.
348 Stick with tag-strings in initialization then switch to numbers for
349 IO-algorithm perhaps? Can probably afford more string-matching in
352 <li id=
"sec-7.8"><span class=
"done">DONE
</span> inner_dmv() should disregard rules with heads not in sent
<span class=
"tag">OPTIMIZE
</span><br/>
353 <span class=
"timestamp-kwd">CLOSED:
</span> <span class=
"timestamp">2008-
06-
08 Sun
10:
18</span><br/>
354 If the sentence is
"nn vbd det nn", we should not even look at rules
357 rule.head() not in
"nn vbd det nn".split()
359 This is ruled out by getting rules from g.rules(LHS, sent).
362 Also, we optimize this further by saying we don't even recurse into
363 attachment rules where
365 rule.head() not in sent[ s :r+
1]
366 rule.head() not in sent[r+
1:t+
1]
368 meaning, if we're looking at the span
"vbd det", we only use
369 attachment rules where both daughters are members of ['vbd','det']
370 (although we don't (yet) care about removing rules that rewrite to the
371 same tag if there are no duplicate tags in the span, etc., that would
372 be a lot of trouble for little potential gain).
379 <div id=
"outline-container-8" class=
"outline-2">
380 <h2 id=
"sec-8">8 Adjacency and combining it with inner()
</h2>
383 <p>Each DMV_Rule now has both a probN and a probA, for
384 adjacencies. inner() needs the correct one in each case.
387 Adjacency gives a problem with duplicate words/tags, eg. in the
388 sentence
"a a b". If this has the dependency structure b-
>a
<sub>0</sub>-
>a
<sub>1</sub>,
389 then b is non-adjacent to a
<sub>0</sub> and should use probN (for the LRStop and
390 the attachment of a
<sub>0</sub>), while the other rules should all use
391 probA. But within the e(
0,
2,b) we can't just say
"oh, a has index 0
392 so it's not adjacent to 2", since there's also an a at index
1, and
393 there's also a dependency structure b-
>a
<sub>1</sub>-
>a
<sub>0</sub> for that. We want
394 both. And in possibly much more complex versions.
397 So now we call inner() for each location of a head, and on each
398 terminal, loc_h must equal s (and t). In the recursive attachment
399 calls, we use the locations (sentence indices) of words to the left or
400 right of the head in calls to inner(). loc_h lets us check whether we
404 In each inner() call, loc_h is the location of the root of this
405 dependency structure.
407 <li id=
"sec-8.1">Possible alternate type of adjacency
<br/>
408 K
&M's adjacency is just whether or not an argument has been generated
409 in the current direction yet. One could also make a stronger type of
410 adjacency, where h and a are not adjacent if b is in between, eg. with
411 the sentence
"a b h" and the structure ((h-
>a), (a-
>b)), h is
412 K
&M-adjacent to a, but not next to a, since b is in between. It's easy
413 to check this type of adjacency in inner(), but it needs new rules for
421 <div id=
"outline-container-9" class=
"outline-2">
422 <h2 id=
"sec-9">9 Expectation Maximation in IO/DMV-terms
</h2>
425 <p>inner(s,t,LHS) calculates the expected number of trees headed by LHS
426 from s to t (sentence positions). This uses the P_STOP and P_CHOOSE
427 values, which have been conveniently distributed into CNF rules as
428 probN and probA (non-adjacent and adjacent probabilites).
431 When re-estimating, we use the expected values from inner() to get new
432 values for P_STOP and P_CHOOSE. When we've re-estimated for the entire
433 corpus, we distribute P_STOP and P_CHOOSE into the CNF rules again, so
434 that in the next round we use new probN and probA to find
438 The distribution of P_STOP and P_CHOOSE into CNF rules also happens in
439 init_normalize() (here along with the creation of P_STOP and
440 P_CHOOSE); P_STOP is used to create CNF rules where one branch of the
441 rule is STOP, P_CHOOSE is used to create rules of the form
448 Since
"adjacency" is not captured in regular CNF rules, we need two
449 probabilites for each rule, and inner() has to know when to use which.
452 <li id=
"sec-9.1"><span class=
"todo">TODO
</span> Corpus access
<br/>
454 <li id=
"sec-9.2"><span class=
"todo">TOGROK
</span> sentences or rules as the
"outer loop"?
<span class=
"tag">OPTIMIZE
</span><br/>
455 In regard to the E/M-step, finding P
<sub>STOP
</sub>, P
<sub>CHOOSE
</sub>.
464 <div id=
"outline-container-10" class=
"outline-2">
465 <h2 id=
"sec-10">10 Python-stuff
</h2>
468 <p>Make those debug statements take a bit less space in emacs:
470 (font-lock-add-keywords
472 '((
"\\<\\(\\(io\\.\\)?debug(.+)\\)" 1 font-lock-preprocessor-face t)))
477 <a href=
"src/pseudo.py">pseudo.py
</a>
480 <a href=
"http://nltk.org/doc/en/structured-programming.html">http://nltk.org/doc/en/structured-programming.html
</a> recursive dynamic
483 <a href=
"http://nltk.org/doc/en/advanced-parsing.html">http://nltk.org/doc/en/advanced-parsing.html
</a>
486 <a href=
"http://jaynes.colorado.edu/PythonIdioms.html">http://jaynes.colorado.edu/PythonIdioms.html
</a>
496 <div id=
"outline-container-11" class=
"outline-2">
497 <h2 id=
"sec-11">11 Git
</h2>
500 <p>Repository web page:
<a href=
"http://repo.or.cz/w/dmvccm.git">http://repo.or.cz/w/dmvccm.git
</a>
503 Setting up a new project:
507 git commit -m
"first release"
511 Later on: (
<code>-a
</code> does
<code>git rm
</code> and
<code>git add
</code> automatically)
514 git commit -a -m
"some subsequent release"
518 Then push stuff up to the remote server:
520 git push git+ssh://username@repo.or.cz/srv/git/dmvccm.git master
524 (
<code>eval `ssh-agent`
</code> and
<code>ssh-add
</code> to avoid having to type in keyphrase all
528 Make a copy of the (remote) master branch:
530 git clone git://repo.or.cz/dmvccm.git
534 Make and name a new branch in this folder
536 git checkout -b mybranch
540 To save changes in
<code>mybranch
</code>:
546 Go back to the master branch (uncommitted changes from
<code>mybranch
</code> are
555 git add --interactive
560 <a href=
"http://www-cs-students.stanford.edu/~blynn//gitmagic/">http://www-cs-students.stanford.edu/~blynn//gitmagic/
</a>
563 <div id=
"postamble"><p class=
"author"> Author: Kevin Brubeck Unhammer
564 <a href=
"mailto:K.BrubeckUnhammer at student uva nl "><K.BrubeckUnhammer at student uva nl
></a>
566 <p class=
"date"> Date:
2008/
06/
09 13:
04:
26</p>
567 </div><p class=
"postamble">Skrive vha. emacs +
<a href='http://orgmode.org/'
>org-mode
</a></p></body>