1 <!DOCTYPE html PUBLIC
"-//W3C//DTD XHTML 1.0 Strict//EN"
2 "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
3 <html xmlns=
"http://www.w3.org/1999/xhtml"
4 lang=
"en" xml:
lang=
"en">
6 <title>DMV/CCM
– todo-list / progress
</title>
7 <meta http-equiv=
"Content-Type" content=
"text/html;charset=utf-8"/>
8 <meta name=
"generator" content=
"Org-mode"/>
9 <meta name=
"generated" content=
"2008/06/08 12:43:26"/>
10 <meta name=
"author" content=
"Kevin Brubeck Unhammer"/>
11 <link rel=
"stylesheet" type=
"text/css" href=
"http://www.student.uib.no/~kun041/org.css">
13 <h1 class=
"title">DMV/CCM
– todo-list / progress
</h1>
16 <div id=
"table-of-contents">
17 <h2>Table of Contents
</h2>
18 <div id=
"text-table-of-contents">
20 <li><a href=
"#sec-1">1 dmvccm report and project
</a></li>
21 <li><a href=
"#sec-2">2 [#A] P_STOP and P_CHOOSE for IO/EM
</a></li>
22 <li><a href=
"#sec-3">3 Initialization
</a></li>
23 <li><a href=
"#sec-4">4 [#C] Deferred
</a></li>
24 <li><a href=
"#sec-5">5 Adjacency and combining it with inner()
</a></li>
25 <li><a href=
"#sec-6">6 Expectation Maximation in IO/DMV-terms
</a></li>
26 <li><a href=
"#sec-7">7 Python-stuff
</a></li>
27 <li><a href=
"#sec-8">8 Git
</a></li>
32 <div id=
"outline-container-1" class=
"outline-2">
33 <h2 id=
"sec-1">1 dmvccm report and project
</h2>
36 <p><span class=
"timestamp-kwd">DEADLINE:
</span> <span class=
"timestamp">2008-
06-
30 Mon
</span><br/>
37 But absolute, extended, really-quite-dead-now deadline: August
41 <a href=
"src/dmv.py">dmv.py
</a>
44 <a href=
"src/io.py">io.py
</a>
47 <a href=
"src/harmonic.py">harmonic.py
</a>
56 <div id=
"outline-container-2" class=
"outline-2">
57 <h2 id=
"sec-2">2 <span class=
"todo">TODO
</span> [#A] P_STOP and P_CHOOSE for IO/EM
</h2>
60 <p><a href=
"src/dmv.py">dmv-P_STOP
</a>
61 Remember: The P
<sub>STOP
</sub> formula is upside-down (left-to-right also).
62 (In the article..not the thesis)
65 Remember: Initialization makes some
"short-cut" rules, these will also
66 have to be updated along with the other P
<sub>STOP
</sub> updates:
69 b[(NOBAR, n
<sub>h
</sub>), 'h'] =
1.0 # always
72 b[(RBAR, n
<sub>h
</sub>), 'h'] = h_.probA # h_ is RBAR stop rule
75 b[(LRBAR, n
<sub>h
</sub>), 'h'] = h_.probA * _ h_.probA
78 <li id=
"sec-2.1">P_STOP formulas for various dir and adj:
<br/>
82 P
<sub>STOP
</sub>(STOP|h,L,non_adj) =
∑<sub>corpus
</sub> ∑<sub>s
<loc(h)
</sub> ∑<sub>t
</sub>
83 inner(s,t,_h_,
…) /
∑<sub>corpus
</sub> ∑<sub>s
<loc(h)
</sub> ∑<sub>t
</sub>
84 inner(s,t,h_,
…)
87 P
<sub>STOP
</sub>(STOP|h,L,adj) =
∑<sub>corpus
</sub> ∑<sub>s=loc(h)
</sub> ∑<sub>t
</sub>
88 inner(s,t,_h_,
…) /
∑<sub>corpus
</sub> ∑<sub>s=loc(h)
</sub> ∑<sub>t
</sub>
89 inner(s,t,h_,
…)
92 P
<sub>STOP
</sub>(STOP|h,R,non_adj) =
∑<sub>corpus
</sub> ∑<sub>s
</sub> ∑<sub>t
>loc(h)
</sub>
93 inner(s,t,_h_,
…) /
∑<sub>corpus
</sub> ∑<sub>s
</sub> ∑<sub>t
>loc(h)
</sub>
94 inner(s,t,h_,
…)
97 P
<sub>STOP
</sub>(STOP|h,R,adj) =
∑<sub>corpus
</sub> ∑<sub>s
</sub> ∑<sub>t=loc(h)
</sub>
98 inner(s,t,_h_,
…) /
∑<sub>corpus
</sub> ∑<sub>s
</sub> ∑<sub>t=loc(h)
</sub>
99 inner(s,t,h_,
…)
104 <p>(And P
<sub>STOP
</sub>(-STOP|
…) =
1 - P
<sub>STOP
</sub>(STOP|
…) )
106 <li id=
"sec-2.2">P_CHOOSE formula:
<br/>
110 P
<sub>CHOOSE
</sub>(a|h,L) =
∑<sub>corpus
</sub> ∑<sub>s
<loc(h)
</sub> ∑<sub>t=loc(h)
</sub> ∑<sub>r
<t
</sub>
111 inner(s,r,_a_,
…) /
∑<sub>corpus
</sub> ∑<sub>s
<loc(h)
</sub> ∑<sub>t=loc(h)
</sub>
112 inner(s,t,h_,
…)
120 P
<sub>CHOOSE
</sub>(a|h,R) =
∑<sub>corpus
</sub> ∑<sub>s=loc(h)
</sub> ∑<sub>r
>s
</sub>
121 inner(s,r,_a_,
…) /
∑<sub>corpus
</sub> ∑<sub>s=loc(h)
</sub> ∑<sub>t
</sub>
122 inner(s,t,h,
…)
138 <div id=
"outline-container-3" class=
"outline-2">
139 <h2 id=
"sec-3">3 Initialization
</h2>
142 <p><a href=
"/Users/kiwibird/Documents/Skole/V08/Probability/dmvccm/src/dmv.py">dmv-inits
</a>
145 We do have to go through the corpus, since the probabilities are based
146 on how far away in the sentence arguments are from their heads.
148 <li id=
"sec-3.1"><span class=
"todo">TODO
</span> Separate initialization to another file?
<span class=
"tag">PRETTIER
</span><br/>
151 <li id=
"sec-3.2"><span class=
"todo">TOGROK
</span> CCM Initialization
<br/>
152 P
<sub>SPLIT
</sub> used here
… how, again?
154 <li id=
"sec-3.3"><span class=
"done">DONE
</span> DMV Initialization probabilities
<br/>
155 (from initialization frequency)
157 <li id=
"sec-3.4"><span class=
"done">DONE
</span> DMV Initialization frequencies
<br/>
158 <span class=
"timestamp-kwd">CLOSED:
</span> <span class=
"timestamp">2008-
05-
27 Tue
20:
04</span><br/>
160 <li id=
"sec-3.4.1">P_STOP
<br/>
161 P
<sub>STOP
</sub> is not well defined by K
&M. One possible interpretation given
162 the sentence [det nn vb nn] is
164 f_{STOP}( STOP|det, L, adj) +
1
165 f_{STOP}(-STOP|det, L, adj) +
0
166 f_{STOP}( STOP|det, L, non_adj) +
1
167 f_{STOP}(-STOP|det, L, non_adj) +
0
168 f_{STOP}( STOP|det, R, adj) +
0
169 f_{STOP}(-STOP|det, R, adj) +
1
171 f_{STOP}( STOP|nn, L, adj) +
0
172 f_{STOP}(-STOP|nn, L, adj) +
1
173 f_{STOP}( STOP|nn, L, non_adj) +
1 # since there's at least one to the left
174 f_{STOP}(-STOP|nn, L, non_adj) +
0
177 <li id=
"sec-3.4.1.1"><span class=
"todo">TODO
</span> tweak
<br/>
179 f[head, 'STOP', 'LN'] += (i_h
<=
1) # first two words
180 f[head, '-STOP', 'LN'] += (not i_h
<=
1)
181 f[head, 'STOP', 'LA'] += (i_h ==
0) # very first word
182 f[head, '-STOP', 'LA'] += (not i_h ==
0)
183 f[head, 'STOP', 'RN'] += (i_h
>= n -
2) # last two words
184 f[head, '-STOP', 'RN'] += (not i_h
>= n -
2)
185 f[head, 'STOP', 'RA'] += (i_h == n -
1) # very last word
186 f[head, '-STOP', 'RA'] += (not i_h == n -
1)
190 # this one requires some additional rewriting since it
191 # introduces divisions by zero
192 f[head, 'STOP', 'LN'] += (i_h ==
1) # second word
193 f[head, '-STOP', 'LN'] += (not i_h
<=
1) # not first two
194 f[head, 'STOP', 'LA'] += (i_h ==
0) # first word
195 f[head, '-STOP', 'LA'] += (not i_h ==
0) # not first
196 f[head, 'STOP', 'RN'] += (i_h == n -
2) # second-to-last
197 f[head, '-STOP', 'RN'] += (not i_h
>= n -
2) # not last two
198 f[head, 'STOP', 'RA'] += (i_h == n -
1) # last word
199 f[head, '-STOP', 'RA'] += (not i_h == n -
1) # not last
203 f[head, 'STOP', 'LN'] += (i_h ==
1) # second word
204 f[head, '-STOP', 'LN'] += (not i_h ==
1) # not second
205 f[head, 'STOP', 'LA'] += (i_h ==
0) # first word
206 f[head, '-STOP', 'LA'] += (not i_h ==
0) # not first
207 f[head, 'STOP', 'RN'] += (i_h == n -
2) # second-to-last
208 f[head, '-STOP', 'RN'] += (not i_h == n -
2) # not second-to-last
209 f[head, 'STOP', 'RA'] += (i_h == n -
1) # last word
210 f[head, '-STOP', 'RA'] += (not i_h == n -
1) # not last
213 "all words take the same number of arguments" interpreted as
216 p_STOP(head, 'STOP', 'LN') =
0.3
217 p_STOP(head, 'STOP', 'LA') =
0.5
218 p_STOP(head, 'STOP', 'RN') =
0.4
219 p_STOP(head, 'STOP', 'RA') =
0.7
221 (which we easily may tweak in init_zeros())
225 <li id=
"sec-3.4.2">P_CHOOSE
<br/>
226 Go through the corpus, counting distances between heads and
227 arguments. In [det nn vb nn], we give
230 f
<sub>CHOOSE
</sub>(nn|det, R) +
1/
1 + C
233 f
<sub>CHOOSE
</sub>(vb|det, R) +
1/
2 + C
236 f
<sub>CHOOSE
</sub>(nn|det, R) +
1/
3 + C
239 If this were the full corpus, P
<sub>CHOOSE
</sub>(nn|det, R) would have
240 (
1+
1/
3+
2C) / sum_a f
<sub>CHOOSE
</sub>(a|det, R)
247 <p>The ROOT gets
"each argument with equal probability", so in a sentence
248 of three words,
1/
3 for each (in [nn vb nn], 'nn' gets
2/
3). Basically
249 just a frequency count of the corpus
…
258 <div id=
"outline-container-4" class=
"outline-2">
259 <h2 id=
"sec-4">4 [#C] Deferred
</h2>
263 <li id=
"sec-4.1"><span class=
"todo">TODO
</span> if loc_h == t, no need to try right-attachment rules
<span class=
"tag">OPTIMIZE
</span><br/>
264 ** TODO if loc_h == s, no need to try left-attachment rules :OPTIMIZE:
266 <li id=
"sec-4.3"><span class=
"done">DONE
</span> inner_dmv() should disregard rules with heads not in sent
<span class=
"tag">OPTIMIZE
</span><br/>
267 <span class=
"timestamp-kwd">CLOSED:
</span> <span class=
"timestamp">2008-
06-
08 Sun
10:
18</span><br/>
268 If the sentence is
"nn vbd det nn", we should not even look at rules
271 rule.head() not in
"nn vbd det nn".split()
273 This is ruled out by getting rules from g.rules(LHS, sent).
276 Also, we optimize this further by saying we don't even recurse into
277 attachment rules where
279 rule.head() not in sent[ s :r+
1]
280 rule.head() not in sent[r+
1:t+
1]
282 meaning, if we're looking at the span
"vbd det", we only use
283 attachment rules where both daughters are members of ['vbd','det']
284 (although we don't (yet) care about removing rules that rewrite to the
285 same tag if there are no duplicate tags in the span, etc., that would
286 be a lot of trouble for little potential gain).
288 <li id=
"sec-4.4"><span class=
"todo">TODO
</span> when reestimating P_STOP etc, remove rules with p
< epsilon
<span class=
"tag">OPTIMIZE
</span><br/>
290 <li id=
"sec-4.5"><span class=
"todo">TODO
</span> inner_dmv, short ranges and impossible attachment
<span class=
"tag">OPTIMIZE
</span><br/>
291 If s-t
<=
2, there can be only one attachment below, so don't recurse
292 with both Lattach=True and Rattach=True.
295 If s-t
<=
1, there can be no attachment below, so only recurse with
296 Lattach=False, Rattach=False.
299 Put this in the loop under rewrite rules (could also do it in the STOP
300 section, but that would only have an effect on very short sentences).
302 <li id=
"sec-4.6"><span class=
"todo">TODO
</span> clean up the module files
<span class=
"tag">PRETTIER
</span><br/>
303 Is there better way to divide dmv and harmonic? There's a two-way
304 dependency between the modules. Guess there could be a third file that
305 imports both the initialization and the actual EM stuff, while a file
306 containing constants and classes could be imported by all others:
308 dmv.py imports dmv_EM.py imports dmv_classes.py
309 dmv.py imports dmv_inits.py imports dmv_classes.py
313 <li id=
"sec-4.7"><span class=
"todo">TOGROK
</span> Some (tagged) sentences are bound to come twice
<span class=
"tag">OPTIMIZE
</span><br/>
314 Eg, first sort and count, so that the corpus
315 [['nn','vbd','det','nn'],
316 ['vbd','nn','det','nn'],
317 ['nn','vbd','det','nn']]
319 [(['nn','vbd','det','nn'],
2),
320 (['vbd','nn','det','nn'],
1)]
321 and then in each loop through sentences, make sure we handle the
325 Is there much to gain here?
328 <li id=
"sec-4.8"><span class=
"todo">TOGROK
</span> tags as numbers or tags as strings?
<span class=
"tag">OPTIMIZE
</span><br/>
329 Need to clean up the representation.
332 Stick with tag-strings in initialization then switch to numbers for
333 IO-algorithm perhaps? Can probably afford more string-matching in
341 <div id=
"outline-container-5" class=
"outline-2">
342 <h2 id=
"sec-5">5 Adjacency and combining it with inner()
</h2>
345 <p>Each DMV_Rule now has both a probN and a probA, for
346 adjacencies. inner() needs the correct one in each case.
349 Adjacency gives a problem with duplicate words/tags, eg. in the
350 sentence
"a a b". If this has the dependency structure b-
>a
<sub>0</sub>-
>a
<sub>1</sub>,
351 then b is non-adjacent to a
<sub>0</sub> and should use probN (for the LRStop and
352 the attachment of a
<sub>0</sub>), while the other rules should all use
353 probA. But within the e(
0,
2,b) we can't just say
"oh, a has index 0
354 so it's not adjacent to 2", since there's also an a at index
1, and
355 there's also a dependency structure b-
>a
<sub>1</sub>-
>a
<sub>0</sub> for that. We want
356 both. And in possibly much more complex versions.
359 So now we call inner() for each location of a head, and on each
360 terminal, loc_h must equal s (and t). In the recursive attachment
361 calls, we use the locations (sentence indices) of words to the left or
362 right of the head in calls to inner(). loc_h lets us check whether we
366 In each inner() call, loc_h is the location of the root of this
367 dependency structure.
369 <li id=
"sec-5.1"><span class=
"done">DONE
</span> [#A] test and debug my brilliant idea
<br/>
370 <span class=
"timestamp-kwd">CLOSED:
</span> <span class=
"timestamp">2008-
06-
08 Sun
10:
28</span><br/>
372 <li id=
"sec-5.2"><span class=
"done">DONE
</span> implement my brilliant idea.
<br/>
373 <span class=
"timestamp-kwd">CLOSED:
</span> <span class=
"timestamp">2008-
06-
01 Sun
17:
19</span><br/>
374 <a href=
"src/dmv.py">e(sti) in dmv.py
</a>
377 <li id=
"sec-5.3"><span class=
"done">DONE
</span> [#A] test inner() on sentences with duplicate words
<br/>
378 Works with eg. the sentence
"h h h"
385 <div id=
"outline-container-6" class=
"outline-2">
386 <h2 id=
"sec-6">6 Expectation Maximation in IO/DMV-terms
</h2>
389 <p>inner(s,t,LHS) calculates the expected number of trees headed by LHS
390 from s to t (sentence positions). This uses the P_STOP and P_CHOOSE
391 values, which have been conveniently distributed into CNF rules as
392 probN and probA (non-adjacent and adjacent probabilites).
395 When re-estimating, we use the expected values from inner() to get new
396 values for P_STOP and P_CHOOSE. When we've re-estimated for the entire
397 corpus, we distribute P_STOP and P_CHOOSE into the CNF rules again, so
398 that in the next round we use new probN and probA to find
402 The distribution of P_STOP and P_CHOOSE into CNF rules also happens in
403 init_normalize() (here along with the creation of P_STOP and
404 P_CHOOSE); P_STOP is used to create CNF rules where one branch of the
405 rule is STOP, P_CHOOSE is used to create rules of the form
412 Since
"adjacency" is not captured in regular CNF rules, we need two
413 probabilites for each rule, and inner() has to know when to use which.
416 <li id=
"sec-6.1"><span class=
"todo">TODO
</span> Corpus access
<br/>
418 <li id=
"sec-6.2"><span class=
"todo">TOGROK
</span> sentences or rules as the
"outer loop"?
<span class=
"tag">OPTIMIZE
</span><br/>
419 In regard to the E/M-step, finding P
<sub>STOP
</sub>, P
<sub>CHOOSE
</sub>.
428 <div id=
"outline-container-7" class=
"outline-2">
429 <h2 id=
"sec-7">7 Python-stuff
</h2>
434 <a href=
"src/pseudo.py">pseudo.py
</a>
437 <a href=
"http://nltk.org/doc/en/structured-programming.html">http://nltk.org/doc/en/structured-programming.html
</a> recursive dynamic
440 <a href=
"http://nltk.org/doc/en/advanced-parsing.html">http://nltk.org/doc/en/advanced-parsing.html
</a>
443 <a href=
"http://jaynes.colorado.edu/PythonIdioms.html">http://jaynes.colorado.edu/PythonIdioms.html
</a>
453 <div id=
"outline-container-8" class=
"outline-2">
454 <h2 id=
"sec-8">8 Git
</h2>
457 <p>Setting up a new project:
461 git commit -m
"first release"
465 Later on: (
<code>-a
</code> does
<code>git rm
</code> and
<code>git add
</code> automatically)
468 git commit -a -m
"some subsequent release"
472 Then push stuff up to the remote server:
474 git push git+ssh://username@repo.or.cz/srv/git/dmvccm.git master
478 (
<code>eval `ssh-agent`
</code> and
<code>ssh-add
</code> to avoid having to type in keyphrase all
482 Make a copy of the (remote) master branch:
484 git clone git://repo.or.cz/dmvccm.git
488 Make and name a new branch in this folder
490 git checkout -b mybranch
494 To save changes in
<code>mybranch
</code>:
500 Go back to the master branch (uncommitted changes from
<code>mybranch
</code> are
509 git add --interactive
514 <a href=
"http://www-cs-students.stanford.edu/~blynn//gitmagic/">http://www-cs-students.stanford.edu/~blynn//gitmagic/
</a>
517 <div id=
"postamble"><p class=
"author"> Author: Kevin Brubeck Unhammer
518 <a href=
"mailto:K.BrubeckUnhammer at student uva nl "><K.BrubeckUnhammer at student uva nl
></a>
520 <p class=
"date"> Date:
2008/
06/
08 12:
43:
26</p>
521 </div><p class=
"postamble">Skrive vha. emacs +
<a href='http://orgmode.org/'
>org-mode
</a></p></body>