before removing sum_hat_a from p_ATTACH
[dmvccm.git] / DMVCCM.html
blob9f31d4b6e007aef50252e3315537804e85b68fe8
1 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
2 "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
3 <html xmlns="http://www.w3.org/1999/xhtml"
4 lang="en" xml:lang="en">
5 <head>
6 <title>DMV/CCM &ndash; todo-list / progress</title>
7 <meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>
8 <meta name="generator" content="Org-mode"/>
9 <meta name="generated" content="2008-08-29 15:41:07 CEST"/>
10 <meta name="author" content="Kevin Brubeck Unhammer"/>
11 <style type="text/css">
12 html { font-family: Times, serif; font-size: 12pt; }
13 .title { text-align: center; }
14 .todo { color: red; }
15 .done { color: green; }
16 .tag { background-color:lightblue; font-weight:normal }
17 .target { }
18 .timestamp { color: grey }
19 .timestamp-kwd { color: CadetBlue }
20 p.verse { margin-left: 3% }
21 pre {
22 border: 1pt solid #AEBDCC;
23 background-color: #F3F5F7;
24 padding: 5pt;
25 font-family: courier, monospace;
26 font-size: 90%;
27 overflow:auto;
29 table { border-collapse: collapse; }
30 td, th { vertical-align: top; }
31 dt { font-weight: bold; }
32 </style><link rel="stylesheet" type="text/css" href="http://www.student.uib.no/~kun041/org.css">
33 <!-- override with local style.css: -->
34 <link rel="stylesheet" type="text/css" href="./style.css">
35 </head><body>
36 <h1 class="title">DMV/CCM &ndash; todo-list / progress</h1>
37 <div id="table-of-contents">
38 <h2>Table of Contents</h2>
39 <div id="text-table-of-contents">
40 <ul>
41 <li><a href="#sec-1">1 DMV/CCM report and project</a></li>
42 <li><a href="#sec-2">2 Notation</a></li>
43 <li><a href="#sec-3">3 Testing the dependency parsed WSJ</a>
44 <ul>
45 <li><a href="#sec-3.1">3.1 [#A] Should <code>def evaluate</code> use add_root?</a></li>
46 </ul>
47 </li>
48 <li><a href="#sec-4">4 [#C] Alternative CNF for DMV</a>
49 <ul>
50 <li><a href="#sec-4.1">4.1 [#A] Make and implement an equivalent grammar that's <i>pure</i> CNF</a></li>
51 <li><a href="#sec-4.2">4.2 [#A] convert L&amp;Y-based reestimation into P_ATTACH and P_STOP values</a></li>
52 <li><a href="#sec-4.3">4.3 [#C] move as much as possible into common_dmv.py</a></li>
53 <li><a href="#sec-4.4">4.4 L&amp;Y-based reestimation for cnf_dmv</a></li>
54 <li><a href="#sec-4.5">4.5 dmv2cnf re-estimation formulas</a></li>
55 <li><a href="#sec-4.6">4.6 inner and outer for cnf_dmv.py, also cnf_harmonic.py </a></li>
56 </ul>
57 </li>
58 <li><a href="#sec-5">5 Combine CCM with DMV</a></li>
59 <li><a href="#sec-6">6 Reestimate P_ORDER ?</a></li>
60 <li><a href="#sec-7">7 Most Probable Parse</a>
61 <ul>
62 <li><a href="#sec-7.1">7.1 Find MPP with CCM</a></li>
63 <li><a href="#sec-7.2">7.2 Find Most Probable Parse of given test sentence, in DMV</a></li>
64 </ul>
65 </li>
66 <li><a href="#sec-8">8 Initialization </a>
67 <ul>
68 <li><a href="#sec-8.1">8.1 CCM Initialization </a></li>
69 </ul>
70 </li>
71 <li><a href="#sec-9">9 [#C] Deferred</a>
72 <ul>
73 <li><a href="#sec-9.1">9.1 Clean up reestimation code</a></li>
74 <li><a href="#sec-9.2">9.2 [#A] compare speed of w_left/right(&hellip;) and w(LEFT/RIGHT, &hellip;)</a></li>
75 <li><a href="#sec-9.3">9.3 when reestimating P_STOP etc, remove rules with p &lt; epsilon</a></li>
76 <li><a href="#sec-9.4">9.4 inner_dmv, short ranges and impossible attachment</a></li>
77 <li><a href="#sec-9.5">9.5 clean up the module files</a></li>
78 <li><a href="#sec-9.6">9.6 Some (tagged) sentences are bound to come twice</a></li>
79 <li><a href="#sec-9.7">9.7 tags as numbers or tags as strings?</a></li>
80 </ul>
81 </li>
82 <li><a href="#sec-10">10 Adjacency and combining it with the inside-outside algorithm</a>
83 <ul>
84 <li><a href="#sec-10.1">10.1 Possible alternate type of adjacency</a></li>
85 </ul>
86 </li>
87 <li><a href="#sec-11">11 Python-stuff</a></li>
88 <li><a href="#sec-12">12 Git</a></li>
89 </ul>
90 </div>
91 </div>
93 <div id="outline-container-1" class="outline-2">
94 <h2 id="sec-1">1 DMV/CCM report and project</h2>
95 <div id="text-1">
97 <ul>
98 <li>
99 DMV-<a href="tex/formulas.pdf">formulas.pdf</a> &ndash; <i>clear</i> information =D
100 </li>
101 <li>
102 <a href="src/main.py">main.py</a> &ndash; evaluation, corpus likelihoods
103 </li>
104 <li>
105 <a href="src/wsjdep.py">wsjdep.py</a> &ndash; corpus
107 </li>
108 <li>
109 <a href="src/loc_h_dmv.py">loc_h_dmv.py</a> &ndash; DMV-IO and reestimation
110 </li>
111 <li>
112 <a href="src/loc_h_harmonic.py">loc_h_harmonic.py</a> &ndash; DMV initialization
114 </li>
115 <li>
116 <a href="src/common_dmv.py">common_dmv.py</a> &ndash; various functions used by loc_h_dmv and others
117 </li>
118 <li>
119 <a href="src/io.py">io.py</a> &ndash; non-DMV IO
121 </li>
122 <li>
123 <a href="src/cnf_dmv.py">cnf_dmv.py</a> &ndash; cnf-like implementation of DMV
124 </li>
125 <li>
126 <a href="src/cnf_harmonic.py">cnf_harmonic.py</a> &ndash; initialization for cnf_dmv
128 </li>
129 </ul>
131 <p><a href="http://www.student.uib.no/~kun041/dmvccm/DMVCCM_archive.html">Archived entries</a> from this file.
132 </p></div>
134 </div>
136 <div id="outline-container-2" class="outline-2">
137 <h2 id="sec-2">2 Notation</h2>
138 <div id="text-2">
140 <p><pre class="example">
141 old notes: new notes: in tex/code (constants): in Klein thesis:
142 --------------------------------------------------------------------------------------
143 _h_ _h_ SEAL bar over h
144 h_ h&gt;&lt; RGOL right-under-left-arrow over h
145 h h&gt; GOR right-arrow over h
147 &gt;&lt;h LGOR left-under-right-arrow over h
148 &lt;h GOL left-arrow over h
149 </pre>
150 These are represented in the code as pairs <code>(s_h,h)</code>, where <code>h</code> is an
151 integer (POS-tag) and <code>s_h</code> &isin; <code>{SEAL,RGOL,GOR,LGOR,GOL}</code>.
152 </p>
154 <code>P_ATTACH</code> and <code>P_CHOOSE</code> are synonymous, I try to use the
155 former. Also,
156 <pre class="example">
157 P_GO_AT(a|h,dir,adj) := P_ATTACH(a|h,dir)*(1-P_STOP(STOP|h,dir,adj)
158 </pre>
159 </p>
161 (precalculated after each reestimation with <code>g.p_GO_AT = make_GO_AT(g.p_STOP,g.p_ATTACH)</code>)
162 </p>
163 </div>
165 </div>
167 <div id="outline-container-3" class="outline-2">
168 <h2 id="sec-3">3 Testing the dependency parsed WSJ</h2>
169 <div id="text-3">
171 <p><a href="src/wsjdep.py">wsjdep.py</a> uses NLTK (sort of) to get a dependency parsed version of
172 WSJ10 into the format used in mpp() in loc_h_dmv.py.
173 </p>
175 As a default, <code>WSJDepCorpusReader</code> looks for the file <code>wsj.combined.10.dep</code> in
176 <code>../corpus/wsjdep</code>.
177 </p>
179 Only <code>sents()</code>, <code>tagged_sents()</code> and <code>parsed_sents()</code> (plus a new function
180 <code>tagonly_sents()</code>) are implemented, the other NLTK corpus functions are
181 ..um.. undefined&hellip;
182 </p>
183 </div>
185 <div id="outline-container-3.1" class="outline-3">
186 <h3 id="sec-3.1">3.1 <span class="todo">TODO</span> [#A] Should <code>def evaluate</code> use add_root?</h3>
187 <div id="text-3.1">
189 <p><a href="src/main.py">main.py</a> evaluate
190 <a href="src/wsjdep.py">wsjdep.py</a> add_root
191 </p>
193 (just has to count how many pairs are in there; Precision and Recall)
194 </p></div>
195 </div>
197 </div>
199 <div id="outline-container-4" class="outline-2">
200 <h2 id="sec-4">4 <span class="todo">TODO</span> [#C] Alternative CNF for DMV</h2>
201 <div id="text-4">
205 <a name="dmv2cnf">&nbsp;</a>
206 </p><ul>
207 <li>
208 <a href="src/cnf_dmv.py">cnf_dmv.py</a>
209 </li>
210 <li>
211 <a href="src/cnf_harmonic.py">cnf_harmonic.py</a>
213 </li>
214 </ul>
216 <p>See section 5 of <a href="tex/formulas.pdf">formulas.pdf</a>.
217 </p>
219 Given a grammar with certain p_ATTACH, p_STOP and p_ROOT, we get:
220 <pre class="example">
221 &gt;&gt;&gt; print testgrammar_h():
222 h&gt;&lt; --&gt; h&gt; STOP [0.30]
223 h&gt;&lt; --&gt; &gt;h&gt; STOP [0.40]
224 _h_ --&gt; STOP h&gt;&lt; [1.00]
225 _h_ --&gt; STOP &lt;h&gt;&lt; [1.00]
226 &gt;h&gt; --&gt; h&gt; _h_ [1.00]
227 &gt;h&gt; --&gt; &gt;h&gt; _h_ [1.00]
228 &lt;h&gt;&lt; --&gt; _h_ h&gt;&lt; [0.70]
229 &lt;h&gt;&lt; --&gt; _h_ &lt;h&gt;&lt; [0.60]
230 ROOT --&gt; STOP _h_ [1.00]
231 </pre>
232 </p>
234 </div>
236 <div id="outline-container-4.1" class="outline-3">
237 <h3 id="sec-4.1">4.1 <span class="todo">TODO</span> [#A] Make and implement an equivalent grammar that's <i>pure</i> CNF</h3>
238 <div id="text-4.1">
240 <p>&hellip;since I'm not sure about my unary reestimation rules (section 5 of
241 <a href="tex/formulas.pdf">formulas</a>).
242 </p>
244 For any rule where LHS is <code>_h_</code> we also have a corresponding one with
245 LHS <code>ROOT</code>, only difference being that we multiply in <code>p_ROOT(h)</code>.
246 </p>
248 For any rule where LHS is <code>.h&gt;</code>, we use adjacent probabilities for the
249 left child; if LHS is <code>&lt;h.</code> we use adjacent probabilities for the right
250 child. Only <code>_h_</code> and <code>_h&gt;_</code> (plus <code>ROOT</code>) get to introduce the pre-terminal
251 <code>h</code> (where <code>h</code>, <code>ROOT</code> and <code>_h_</code> all rewrite to the terminal
252 <code>'h'</code>), and only <code>_h_</code> and <code>_h&gt;_</code> (plus <code>ROOT</code>) act as STOP
253 rules (eg. get to multiply in <code>p(STOP)</code>).
254 </p>
256 <pre class="example">
257 h --&gt; 'h' 1
258 _h_ --&gt; 'h' p(STOP|h,L,adj) * p(STOP|h,R,adj)
259 ROOT --&gt; 'h' p(STOP|h,L,adj) * p(STOP|h,R,adj) * p_ROOT(h)
261 _h_ --&gt; h _a_ p(STOP|h,L,adj) * p(STOP|h,R,non) * p(a|h,R)*p(-STOP|h,R,adj)
262 _h_ --&gt; h .h&gt; p(STOP|h,L,adj) * p(STOP|h,R,non)
263 .h&gt; --&gt; _a_ _b_ p(a|h,R)*p(-STOP|h,R,adj) * p(b|h,R)*p(-STOP|h,R,non)
264 .h&gt; --&gt; _a_ h&gt; p(a|h,R)*p(-STOP|h,R,adj)
265 h&gt; --&gt; _a_ _b_ p(a|h,R)*p(-STOP|h,R,non) * p(b|h,R)*p(-STOP|h,R,non)
266 h&gt; --&gt; _a_ h&gt; p(a|h,R)*p(-STOP|h,R,non)
268 _h_ --&gt; _a_ h p(STOP|h,L,non) * p(STOP|h,R,adj) * p(a|h,L)*p(-STOP|h,L,adj)
269 _h_ --&gt; &lt;h. h p(STOP|h,L,non) * p(STOP|h,R,adj)
270 &lt;h. --&gt; _b_ _a_ p(b|h,L)*p(-STOP|h,L,non) * p(a|h,L)*p(-STOP|h,L,adj)
271 &lt;h. --&gt; &lt;h _a_ p(a|h,L)*p(-STOP|h,L,adj)
272 &lt;h --&gt; _a_ _b_ p(a|h,L)*p(-STOP|h,L,non) * p(b|h,L)*p(-STOP|h,L,non)
273 &lt;h --&gt; &lt;h _a_ p(a|h,L)*p(-STOP|h,L,non)
275 _h_ --&gt; &lt;h. _h&gt;_ p(STOP|h,L,non)
276 _h_ --&gt; _a_ _h&gt;_ p(STOP|h,L,non) * p(a|h,L)*p(-STOP|h,L,adj)
277 _h&gt;_ --&gt; h .h&gt; p(STOP|h,R,non)
278 _h&gt;_ --&gt; h _a_ p(STOP|h,R,non) * p(a|h,R)*p(-STOP|h,R,adj)
280 ROOT --&gt; h _a_ p(STOP|h,L,adj) * p(STOP|h,R,non) * p(a|h,R)*p(-STOP|h,R,adj) * p_ROOT(h)
281 ROOT --&gt; h .h&gt; p(STOP|h,L,adj) * p(STOP|h,R,non) * p_ROOT(h)
283 ROOT --&gt; _a_ h p(STOP|h,L,non) * p(STOP|h,R,adj) * p(a|h,L)*p(-STOP|h,L,adj) * p_ROOT(h)
284 ROOT --&gt; &lt;h. h p(STOP|h,L,non) * p(STOP|h,R,adj) * p_ROOT(h)
286 ROOT --&gt; &lt;h. _h&gt;_ p(STOP|h,L,non) * p_ROOT(h)
287 ROOT --&gt; _a_ _h&gt;_ p(STOP|h,L,non) * p(a|h,L)*p(-STOP|h,L,adj) * p_ROOT(h)
289 </pre>
290 </p>
292 Since we have rules rewriting <code>h</code> to <code>a</code> and <code>b</code>, we have a rule-set
293 numbering more than n<sub>tags</sub><sup>2</sup>.
294 </p>
295 </div>
297 </div>
299 <div id="outline-container-4.2" class="outline-3">
300 <h3 id="sec-4.2">4.2 <span class="todo">TOGROK</span> [#A] convert L&amp;Y-based reestimation into P_ATTACH and P_STOP values</h3>
301 <div id="text-4.2">
303 <p>Sum over the various rules? Or something? Must think of this.
304 </p></div>
306 </div>
308 <div id="outline-container-4.3" class="outline-3">
309 <h3 id="sec-4.3">4.3 <span class="todo">TODO</span> [#C] move as much as possible into common_dmv.py</h3>
310 <div id="text-4.3">
312 <p><a href="src/common_dmv.py">common_dmv.py</a>
313 </p></div>
315 </div>
317 <div id="outline-container-4.4" class="outline-3">
318 <h3 id="sec-4.4">4.4 <span class="done">DONE</span> L&amp;Y-based reestimation for cnf_dmv</h3>
319 <div id="text-4.4">
321 <p><span class="timestamp-kwd">CLOSED: </span> <span class="timestamp">2008-08-21 Thu 16:35</span><br/>
322 </p></div>
324 </div>
326 <div id="outline-container-4.5" class="outline-3">
327 <h3 id="sec-4.5">4.5 <span class="done">DONE</span> dmv2cnf re-estimation formulas</h3>
328 <div id="text-4.5">
330 <p><span class="timestamp-kwd">CLOSED: </span> <span class="timestamp">2008-08-21 Thu 16:36</span><br/>
331 </p></div>
333 </div>
335 <div id="outline-container-4.6" class="outline-3">
336 <h3 id="sec-4.6">4.6 <span class="done">DONE</span> inner and outer for cnf_dmv.py, also cnf_harmonic.py </h3>
337 <div id="text-4.6">
339 </div>
340 </div>
342 </div>
344 <div id="outline-container-5" class="outline-2">
345 <h2 id="sec-5">5 <span class="todo">TOGROK</span> Combine CCM with DMV</h2>
346 <div id="text-5">
350 <a name="comboquestions">&nbsp;</a>
351 </p>
353 Questions about the <code>P_COMBO</code> info in <a href="http://www.eecs.berkeley.edu/~klein/papers/klein_thesis.pdf">Klein's thesis</a>:
354 </p><ul>
355 <li>
356 Page 109 (pdf: 125): We have to premultiply "all our probabilities"
357 by the CCM base product <i>&Pi;<sub>&lt;i,j&gt;</sub> P<sub>SPAN</sub>(&alpha;(i,j,s)|false)P<sub>CONTEXT</sub>(&beta;(i,j,s)|false)</i>; which
358 probabilities are included under "all"? I'm assuming this includes
359 <code>P_ATTACH</code> since each time <code>P_ATTACH</code> is used, <i>&phi;</i> is multiplied in
360 (pp.110-111 ibid.); but <i>&phi;</i> is not used for STOPs, so should we not
361 have our CCM product multiplied in there? How about <code>P_ROOT</code>?
362 (Guessing <code>P_ORDER</code> is way out of the question&hellip;)
363 </li>
364 <li>
365 For the outside probabilities, is it correct to assume we multiply
366 in <i>&phi;(j,k)</i> or <i>&phi;(k,i)</i> when calculating <code>inner(i,j...)</code>? (Eg., only
367 for the outside part, not for the whole range.) I don't understand
368 the notation in <code>O()</code> on p.103.
369 </li>
370 </ul>
371 </div>
373 </div>
375 <div id="outline-container-6" class="outline-2">
376 <h2 id="sec-6">6 <span class="todo">TOGROK</span> Reestimate P_ORDER ?</h2>
377 <div id="text-6">
379 </div>
381 </div>
383 <div id="outline-container-7" class="outline-2">
384 <h2 id="sec-7">7 Most Probable Parse</h2>
385 <div id="text-7">
388 </div>
390 <div id="outline-container-7.1" class="outline-3">
391 <h3 id="sec-7.1">7.1 <span class="todo">TOGROK</span> Find MPP with CCM</h3>
392 <div id="text-7.1">
394 </div>
396 </div>
398 <div id="outline-container-7.2" class="outline-3">
399 <h3 id="sec-7.2">7.2 <span class="done">DONE</span> Find Most Probable Parse of given test sentence, in DMV</h3>
400 <div id="text-7.2">
402 <p><span class="timestamp-kwd">CLOSED: </span> <span class="timestamp">2008-07-23 Wed 10:56</span><br/>
403 inner() optionally keeps track of the highest probability children of
404 any node in <code>mpptree</code>. Say we're looking for <code>inner(i,j,(s_h,h),loc_h)</code> in
405 a certain sentence, and we find some possible left and right children,
406 we add to <code>mpptree[i,j,(s_h,h),loc_h]</code> the triple <code>(p, L, R)</code> where <code>L</code> and
407 <code>R</code> are of the same form as the key (<code>i,j,(s_h,h),loc_h</code>) and <code>p</code> is the
408 probability of this node rewriting to <code>L</code> and <code>R</code>,
409 eg. <code>inner(L)*inner(R)*p_GO_AT</code> or <code>p_STOP</code> or whatever. We only add this
410 entry to <code>mpptree</code> if there wasn't a higher-probability entry there
411 before.
412 </p>
414 Then, after <code>inner_sent</code> makes an <code>mpptree</code>, we find the <i>relevant</i>
415 head-argument pairs by searching through the tree using a queue,
416 adding the <code>L</code> and <code>R</code> keys of any entry to the queue as we find them
417 (skipping <code>STOP</code> keys), and adding any attachment entries to a set of
418 triples <code>(head,argument,dir)</code>. Thus we have our most probable parse,
420 <pre class="example">
421 set([( ROOT, (vbd,2),RIGHT),
422 ((vbd,2),(nn,1),LEFT),
423 ((vbd,2),(nn,3),RIGHT),
424 ((nn,1),(det,0),LEFT)])
425 </pre>
426 </p></div>
427 </div>
429 </div>
431 <div id="outline-container-8" class="outline-2">
432 <h2 id="sec-8">8 Initialization </h2>
433 <div id="text-8">
435 <p><a href="/Users/kiwibird/Documents/Skole/V08/Probability/dmvccm/src/dmv.py">dmv-inits</a>
436 </p>
438 We go through the corpus, since the probabilities are based on how far
439 away in the sentence arguments are from their heads.
440 </p>
441 </div>
443 <div id="outline-container-8.1" class="outline-3">
444 <h3 id="sec-8.1">8.1 <span class="todo">TOGROK</span> CCM Initialization </h3>
445 <div id="text-8.1">
447 <p>P<sub>SPLIT</sub> used here&hellip; how, again?
448 </p></div>
449 </div>
451 </div>
453 <div id="outline-container-9" class="outline-2">
454 <h2 id="sec-9">9 [#C] Deferred</h2>
455 <div id="text-9">
457 <p><a href="http://wiki.python.org/moin/PythonSpeed/PerformanceTips">http://wiki.python.org/moin/PythonSpeed/PerformanceTips</a> Eg., use
458 map/reduce/filter/[i for i in [i's]]/(i for i in [i's]) instead of
459 for-loops; use local variables for globals (global variables or or
460 functions), etc.
461 </p>
462 </div>
464 <div id="outline-container-9.1" class="outline-3">
465 <h3 id="sec-9.1">9.1 <span class="todo">TODO</span> Clean up reestimation code &nbsp;&nbsp;&nbsp;<span class="tag">PRETTIER</span></h3>
466 <div id="text-9.1">
468 </div>
470 </div>
472 <div id="outline-container-9.2" class="outline-3">
473 <h3 id="sec-9.2">9.2 <span class="todo">TODO</span> [#A] compare speed of w_left/right(&hellip;) and w(LEFT/RIGHT, &hellip;) &nbsp;&nbsp;&nbsp;<span class="tag">OPTIMIZE</span></h3>
474 <div id="text-9.2">
476 </div>
478 </div>
480 <div id="outline-container-9.3" class="outline-3">
481 <h3 id="sec-9.3">9.3 <span class="todo">TODO</span> when reestimating P_STOP etc, remove rules with p &lt; epsilon &nbsp;&nbsp;&nbsp;<span class="tag">OPTIMIZE</span></h3>
482 <div id="text-9.3">
484 </div>
486 </div>
488 <div id="outline-container-9.4" class="outline-3">
489 <h3 id="sec-9.4">9.4 <span class="todo">TODO</span> inner_dmv, short ranges and impossible attachment &nbsp;&nbsp;&nbsp;<span class="tag">OPTIMIZE</span></h3>
490 <div id="text-9.4">
492 <p>If s-t &lt;= 2, there can be only one attachment below, so don't recurse
493 with both Lattach=True and Rattach=True.
494 </p>
496 If s-t &lt;= 1, there can be no attachment below, so only recurse with
497 Lattach=False, Rattach=False.
498 </p>
500 Put this in the loop under rewrite rules (could also do it in the STOP
501 section, but that would only have an effect on very short sentences).
502 </p></div>
504 </div>
506 <div id="outline-container-9.5" class="outline-3">
507 <h3 id="sec-9.5">9.5 <span class="todo">TODO</span> clean up the module files &nbsp;&nbsp;&nbsp;<span class="tag">PRETTIER</span></h3>
508 <div id="text-9.5">
510 <p>Is there better way to divide dmv and harmonic? There's a two-way
511 dependency between the modules. Guess there could be a third file that
512 imports both the initialization and the actual EM stuff, while a file
513 containing constants and classes could be imported by all others:
514 <pre class="example">
515 dmv.py imports dmv_EM.py imports dmv_classes.py
516 dmv.py imports dmv_inits.py imports dmv_classes.py
517 </pre>
518 </p>
519 </div>
521 </div>
523 <div id="outline-container-9.6" class="outline-3">
524 <h3 id="sec-9.6">9.6 <span class="todo">TOGROK</span> Some (tagged) sentences are bound to come twice &nbsp;&nbsp;&nbsp;<span class="tag">OPTIMIZE</span></h3>
525 <div id="text-9.6">
527 <p>Eg, first sort and count, so that the corpus
528 [['nn','vbd','det','nn'],
529 ['vbd','nn','det','nn'],
530 ['nn','vbd','det','nn']]
531 becomes
532 [(['nn','vbd','det','nn'],2),
533 (['vbd','nn','det','nn'],1)]
534 and then in each loop through sentences, make sure we handle the
535 frequency correctly.
536 </p>
538 Is there much to gain here?
539 </p>
540 </div>
542 </div>
544 <div id="outline-container-9.7" class="outline-3">
545 <h3 id="sec-9.7">9.7 <span class="todo">TOGROK</span> tags as numbers or tags as strings? &nbsp;&nbsp;&nbsp;<span class="tag">OPTIMIZE</span></h3>
546 <div id="text-9.7">
548 <p>Need to clean up the representation.
549 </p>
551 Stick with tag-strings in initialization then switch to numbers for
552 IO-algorithm perhaps? Can probably afford more string-matching in
553 initialization..
554 </p></div>
555 </div>
557 </div>
559 <div id="outline-container-10" class="outline-2">
560 <h2 id="sec-10">10 Adjacency and combining it with the inside-outside algorithm</h2>
561 <div id="text-10">
563 <p>Each DMV_Rule has both a probN and a probA, for adjacencies. inner()
564 and outer() needs the correct one in each case.
565 </p>
567 In each inner() call, loc_h is the location of the head of this
568 dependency structure. In each outer() call, it's the head of the <i>Node</i>,
569 the structure we're looking outside of.
570 </p>
572 We call inner() for each location of a head, and on each terminal,
573 loc_h must equal <code>i</code> (and <code>loc_h+1</code> equal <code>j</code>). In the recursive attachment
574 calls, we use the locations (sentence indices) of words to the left or
575 right of the head in calls to inner(). <i>loc_h lets us check whether we need probN or probA</i>.
576 </p>
577 </div>
579 <div id="outline-container-10.1" class="outline-3">
580 <h3 id="sec-10.1">10.1 Possible alternate type of adjacency</h3>
581 <div id="text-10.1">
583 <p>K&amp;M's adjacency is just whether or not an argument has been generated
584 in the current direction yet. One could also make a stronger type of
585 adjacency, where h and a are not adjacent if b is in between, eg. with
586 the sentence "a b h" and the structure ((h-&gt;a), (a-&gt;b)), h is
587 K&amp;M-adjacent to a, but not next to a, since b is in between. It's easy
588 to check this type of adjacency in inner(), but it needs new rules for
589 P_STOP reestimation.
590 </p></div>
591 </div>
593 </div>
595 <div id="outline-container-11" class="outline-2">
596 <h2 id="sec-11">11 Python-stuff</h2>
597 <div id="text-11">
599 <p>Make those debug statements steal a bit less attention in emacs:
600 <pre class="example">
601 (font-lock-add-keywords
602 'python-mode ; not really regexp, a bit slow
603 '(("^\\( *\\)\\(\\if +'.+' +in +io.DEBUG. *\\(
604 \\1 .+$\\)+\\)" 2 font-lock-preprocessor-face t)))
605 (font-lock-add-keywords
606 'python-mode
607 '(("\\&lt;\\(\\(io\\.\\)?debug(.+)\\)" 1 font-lock-preprocessor-face t)))
608 </pre>
609 </p>
610 <ul>
611 <li>
612 <a href="src/pseudo.py">pseudo.py</a>
613 </li>
614 <li>
615 <a href="http://nltk.org/doc/en/structured-programming.html">http://nltk.org/doc/en/structured-programming.html</a> recursive dynamic
616 </li>
617 <li>
618 <a href="http://nltk.org/doc/en/advanced-parsing.html">http://nltk.org/doc/en/advanced-parsing.html</a>
619 </li>
620 <li>
621 <a href="http://jaynes.colorado.edu/PythonIdioms.html">http://jaynes.colorado.edu/PythonIdioms.html</a>
625 </li>
626 </ul>
627 </div>
629 </div>
631 <div id="outline-container-12" class="outline-2">
632 <h2 id="sec-12">12 Git</h2>
633 <div id="text-12">
635 <p>Repository web page: <a href="http://repo.or.cz/w/dmvccm.git">http://repo.or.cz/w/dmvccm.git</a>
636 </p>
638 Setting up a new project:
639 <pre class="example">
640 git init
641 git add .
642 git commit -m "first release"
643 </pre>
644 </p>
646 Later on: (<code>-a</code> does <code>git rm</code> and <code>git add</code> automatically)
647 <pre class="example">
648 git init
649 git commit -a -m "some subsequent release"
650 </pre>
651 </p>
653 Then push stuff up to the remote server:
654 <pre class="example">
655 git push git+ssh://username@repo.or.cz/srv/git/dmvccm.git master
656 </pre>
657 </p>
659 (<code>eval `ssh-agent`</code> and <code>ssh-add</code> to avoid having to type in keyphrase all
660 the time)
661 </p>
663 Make a copy of the (remote) master branch:
664 <pre class="example">
665 git clone git://repo.or.cz/dmvccm.git
666 </pre>
667 </p>
669 Make and name a new branch in this folder
670 <pre class="example">
671 git checkout -b mybranch
672 </pre>
673 </p>
675 To save changes in <code>mybranch</code>:
676 <pre class="example">
677 git commit -a
678 </pre>
679 </p>
681 Go back to the master branch (uncommitted changes from <code>mybranch</code> are
682 carried over):
683 <pre class="example">
684 git checkout master
685 </pre>
686 </p>
688 Try out:
689 <pre class="example">
690 git add --interactive
691 </pre>
692 </p>
694 Good tutorial:
695 <a href="http://www-cs-students.stanford.edu/~blynn//gitmagic/">http://www-cs-students.stanford.edu/~blynn//gitmagic/</a>
696 </p></div>
697 </div>
698 <div id="postamble"><p class="author"> Author: Kevin Brubeck Unhammer
699 <a href="mailto:K.BrubeckUnhammer at student uva nl ">&lt;K.BrubeckUnhammer at student uva nl &gt;</a>
700 </p>
701 <p class="date"> Date: 2008-08-29 15:41:07 CEST</p>
702 <p>HTML generated by org-mode 6.06b in emacs 22<p>
703 </div><div class="post-postamble" id="nn-postamble">
704 Skrive vha. emacs + <a href='http://orgmode.org/'>org-mode</a>
705 </div>
706 <script src="./post-script.js" type="text/JavaScript">
707 </script></body>
708 </html>