remove old backups (trust in vc)
[dmvccm.git] / DMVCCM.html
blob8b14cc01b68cdff0d8b6a67839693afa4ca6ef1d
1 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
2 "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
3 <html xmlns="http://www.w3.org/1999/xhtml"
4 lang="en" xml:lang="en">
5 <head>
6 <title>DMV/CCM &ndash; todo-list / progress</title>
7 <meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>
8 <meta name="generator" content="Org-mode"/>
9 <meta name="generated" content="2008-09-21 19:00:58 CEST"/>
10 <meta name="author" content="Kevin Brubeck Unhammer"/>
11 <style type="text/css">
12 html { font-family: Times, serif; font-size: 12pt; }
13 .title { text-align: center; }
14 .todo { color: red; }
15 .done { color: green; }
16 .tag { background-color:lightblue; font-weight:normal }
17 .target { }
18 .timestamp { color: grey }
19 .timestamp-kwd { color: CadetBlue }
20 p.verse { margin-left: 3% }
21 pre {
22 border: 1pt solid #AEBDCC;
23 background-color: #F3F5F7;
24 padding: 5pt;
25 font-family: courier, monospace;
26 font-size: 90%;
27 overflow:auto;
29 table { border-collapse: collapse; }
30 td, th { vertical-align: top; }
31 dt { font-weight: bold; }
32 </style><link rel="stylesheet" type="text/css" href="http://www.student.uib.no/~kun041/org.css">
33 <!-- override with local style.css: -->
34 <link rel="stylesheet" type="text/css" href="./style.css">
35 </head><body>
36 <h1 class="title">DMV/CCM &ndash; todo-list / progress</h1>
37 <div id="table-of-contents">
38 <h2>Table of Contents</h2>
39 <div id="text-table-of-contents">
40 <ul>
41 <li><a href="#sec-1">1 DMV/CCM report and project</a></li>
42 <li><a href="#sec-2">2 Notation</a></li>
43 <li><a href="#sec-3">3 Testing the dependency parsed WSJ</a>
44 <ul>
45 <li><a href="#sec-3.1">3.1 [#A] Should <code>def evaluate</code> use add_root?</a></li>
46 </ul>
47 </li>
48 <li><a href="#sec-4">4 Combine CCM with DMV</a></li>
49 <li><a href="#sec-5">5 Reestimate P_ORDER ?</a></li>
50 <li><a href="#sec-6">6 Most Probable Parse</a>
51 <ul>
52 <li><a href="#sec-6.1">6.1 Find MPP with CCM</a></li>
53 <li><a href="#sec-6.2">6.2 Find Most Probable Parse of given test sentence, in DMV</a></li>
54 </ul>
55 </li>
56 <li><a href="#sec-7">7 Initialization </a>
57 <ul>
58 <li><a href="#sec-7.1">7.1 CCM Initialization </a></li>
59 </ul>
60 </li>
61 <li><a href="#sec-8">8 [#C] Alternative CNF for DMV</a>
62 <ul>
63 <li><a href="#sec-8.1">8.1 [#A] Make and implement an equivalent grammar that's <i>pure</i> CNF</a></li>
64 <li><a href="#sec-8.2">8.2 [#A] convert L&amp;Y-based reestimation into P_ATTACH and P_STOP values</a></li>
65 <li><a href="#sec-8.3">8.3 [#C] move as much as possible into common_dmv.py</a></li>
66 <li><a href="#sec-8.4">8.4 L&amp;Y-based reestimation for cnf_dmv</a></li>
67 <li><a href="#sec-8.5">8.5 dmv2cnf re-estimation formulas</a></li>
68 <li><a href="#sec-8.6">8.6 inner and outer for cnf_dmv.py, also cnf_harmonic.py </a></li>
69 </ul>
70 </li>
71 <li><a href="#sec-9">9 [#C] Deferred</a>
72 <ul>
73 <li><a href="#sec-9.1">9.1 Clean up reestimation code</a></li>
74 <li><a href="#sec-9.2">9.2 [#A] compare speed of w_left/right(&hellip;) and w(LEFT/RIGHT, &hellip;)</a></li>
75 <li><a href="#sec-9.3">9.3 when reestimating P_STOP etc, remove rules with p &lt; epsilon</a></li>
76 <li><a href="#sec-9.4">9.4 inner_dmv, short ranges and impossible attachment</a></li>
77 <li><a href="#sec-9.5">9.5 clean up the module files</a></li>
78 <li><a href="#sec-9.6">9.6 Some (tagged) sentences are bound to come twice</a></li>
79 <li><a href="#sec-9.7">9.7 tags as numbers or tags as strings?</a></li>
80 </ul>
81 </li>
82 <li><a href="#sec-10">10 Adjacency and combining it with the inside-outside algorithm</a>
83 <ul>
84 <li><a href="#sec-10.1">10.1 Possible alternate type of adjacency</a></li>
85 </ul>
86 </li>
87 <li><a href="#sec-11">11 Python-stuff</a></li>
88 <li><a href="#sec-12">12 Git</a></li>
89 </ul>
90 </div>
91 </div>
93 <div id="outline-container-1" class="outline-2">
94 <h2 id="sec-1">1 DMV/CCM report and project</h2>
95 <div id="text-1">
97 <p><span class="timestamp-kwd">DEADLINE: </span> <span class="timestamp">2008-09-21 Sun</span><br/>
98 </p><ul>
99 <li>
100 <a href="http://www.student.uib.no/~kun041/dmvccm/report.pdf">report.pdf</a> &ndash; Draft report for the whole project, including formulas
101 for the full algorithms
103 </li>
104 <li>
105 <a href="src/main.py">main.py</a> &ndash; evaluation, corpus likelihoods
106 </li>
107 <li>
108 <a href="src/wsjdep.py">wsjdep.py</a> &ndash; corpus reader for the dependency parsed WSJ
110 </li>
111 <li>
112 <a href="src/loc_h_dmv.py">loc_h_dmv.py</a> &ndash; DMV-IO and reestimation
113 </li>
114 <li>
115 <a href="src/loc_h_harmonic.py">loc_h_harmonic.py</a> &ndash; DMV initialization
117 </li>
118 <li>
119 <a href="src/common_dmv.py">common_dmv.py</a> &ndash; various functions used by loc_h_dmv and others
120 </li>
121 <li>
122 <a href="src/io.py">io.py</a> &ndash; non-DMV IO
124 </li>
125 </ul>
127 <p>Deprecated:
128 </p><ul>
129 <li>
130 <a href="src/cnf_dmv.py">cnf_dmv.py</a> &ndash; cnf-like implementation of DMV
131 </li>
132 <li>
133 <a href="src/cnf_harmonic.py">cnf_harmonic.py</a> &ndash; initialization for cnf_dmv
135 </li>
136 </ul>
138 <p><a href="http://www.student.uib.no/~kun041/dmvccm/DMVCCM_archive.html">Archived entries</a> from this file.
139 </p></div>
141 </div>
143 <div id="outline-container-2" class="outline-2">
144 <h2 id="sec-2">2 Notation</h2>
145 <div id="text-2">
147 <p><pre class="example">
148 old notes: new notes: in tex/code (constants): in Klein thesis:
149 --------------------------------------------------------------------------------------
150 _h_ _h_ SEAL bar over h
151 h_ h&gt;&lt; RGOL right-under-left-arrow over h
152 h h&gt; GOR right-arrow over h
154 &gt;&lt;h LGOR left-under-right-arrow over h
155 &lt;h GOL left-arrow over h
156 </pre>
157 These are represented in the code as pairs <code>(s_h,h)</code>, where <code>h</code> is an
158 integer (POS-tag) and <code>s_h</code> &isin; <code>{SEAL,RGOL,GOR,LGOR,GOL}</code>.
159 </p>
161 <code>P_ATTACH</code> and <code>P_CHOOSE</code> are synonymous, I try to use the
162 former. Also,
163 <pre class="example">
164 P_GO_AT(a|h,dir,adj) := P_ATTACH(a|h,dir)*(1-P_STOP(STOP|h,dir,adj)
165 </pre>
166 </p>
168 (precalculated after each reestimation with <code>g.p_GO_AT = make_GO_AT(g.p_STOP,g.p_ATTACH)</code>)
169 </p>
170 </div>
172 </div>
174 <div id="outline-container-3" class="outline-2">
175 <h2 id="sec-3">3 Testing the dependency parsed WSJ</h2>
176 <div id="text-3">
178 <p><a href="src/wsjdep.py">wsjdep.py</a> uses NLTK (sort of) to get a dependency parsed version of
179 WSJ10 into the format used in mpp() in loc_h_dmv.py.
180 </p>
182 As a default, <code>WSJDepCorpusReader</code> looks for the file <code>wsj.combined.10.dep</code> in
183 <code>../corpus/wsjdep</code>.
184 </p>
186 Only <code>sents()</code>, <code>tagged_sents()</code> and <code>parsed_sents()</code> (plus a new function
187 <code>tagonly_sents()</code>) are implemented, the other NLTK corpus functions are
188 ..um.. undefined&hellip;
189 </p>
190 </div>
192 <div id="outline-container-3.1" class="outline-3">
193 <h3 id="sec-3.1">3.1 <span class="todo">TODO</span> [#A] Should <code>def evaluate</code> use add_root?</h3>
194 <div id="text-3.1">
196 <p><a href="src/main.py">main.py</a> evaluate
197 <a href="src/wsjdep.py">wsjdep.py</a> add_root
198 </p>
200 (just has to count how many pairs are in there; Precision and Recall)
201 </p></div>
202 </div>
204 </div>
206 <div id="outline-container-4" class="outline-2">
207 <h2 id="sec-4">4 <span class="todo">TOGROK</span> Combine CCM with DMV</h2>
208 <div id="text-4">
212 <a name="comboquestions">&nbsp;</a>
213 </p>
215 Questions about the <code>P_COMBO</code> info in <a href="http://www.eecs.berkeley.edu/~klein/papers/klein_thesis.pdf">Klein's thesis</a>:
216 </p><ul>
217 <li>
218 Page 109 (pdf: 125): We have to premultiply "all our probabilities"
219 by the CCM base product <i>&Pi;<sub>&lt;i,j&gt;</sub> P<sub>SPAN</sub>(&alpha;(i,j,s)|false)P<sub>CONTEXT</sub>(&beta;(i,j,s)|false)</i>; which
220 probabilities are included under "all"? I'm assuming this includes
221 <code>P_ATTACH</code> since each time <code>P_ATTACH</code> is used, <i>&phi;</i> is multiplied in
222 (pp.110-111 ibid.); but <i>&phi;</i> is not used for STOPs, so should we not
223 have our CCM product multiplied in there? How about <code>P_ROOT</code>?
224 (Guessing <code>P_ORDER</code> is way out of the question&hellip;)
225 </li>
226 <li>
227 For the outside probabilities, is it correct to assume we multiply
228 in <i>&phi;(j,k)</i> or <i>&phi;(k,i)</i> when calculating <code>inner(i,j...)</code>? (Eg., only
229 for the outside part, not for the whole range.) I don't understand
230 the notation in <code>O()</code> on p.103.
231 </li>
232 </ul>
233 </div>
235 </div>
237 <div id="outline-container-5" class="outline-2">
238 <h2 id="sec-5">5 <span class="todo">TOGROK</span> Reestimate P_ORDER ?</h2>
239 <div id="text-5">
241 </div>
243 </div>
245 <div id="outline-container-6" class="outline-2">
246 <h2 id="sec-6">6 Most Probable Parse</h2>
247 <div id="text-6">
250 </div>
252 <div id="outline-container-6.1" class="outline-3">
253 <h3 id="sec-6.1">6.1 <span class="todo">TOGROK</span> Find MPP with CCM</h3>
254 <div id="text-6.1">
256 </div>
258 </div>
260 <div id="outline-container-6.2" class="outline-3">
261 <h3 id="sec-6.2">6.2 <span class="done">DONE</span> Find Most Probable Parse of given test sentence, in DMV</h3>
262 <div id="text-6.2">
264 <p><span class="timestamp-kwd">CLOSED: </span> <span class="timestamp">2008-07-23 Wed 10:56</span><br/>
265 inner() optionally keeps track of the highest probability children of
266 any node in <code>mpptree</code>. Say we're looking for <code>inner(i,j,(s_h,h),loc_h)</code> in
267 a certain sentence, and we find some possible left and right children,
268 we add to <code>mpptree[i,j,(s_h,h),loc_h]</code> the triple <code>(p, L, R)</code> where <code>L</code> and
269 <code>R</code> are of the same form as the key (<code>i,j,(s_h,h),loc_h</code>) and <code>p</code> is the
270 probability of this node rewriting to <code>L</code> and <code>R</code>,
271 eg. <code>inner(L)*inner(R)*p_GO_AT</code> or <code>p_STOP</code> or whatever. We only add this
272 entry to <code>mpptree</code> if there wasn't a higher-probability entry there
273 before.
274 </p>
276 Then, after <code>inner_sent</code> makes an <code>mpptree</code>, we find the <i>relevant</i>
277 head-argument pairs by searching through the tree using a queue,
278 adding the <code>L</code> and <code>R</code> keys of any entry to the queue as we find them
279 (skipping <code>STOP</code> keys), and adding any attachment entries to a set of
280 triples <code>(head,argument,dir)</code>. Thus we have our most probable parse,
282 <pre class="example">
283 set([( ROOT, (vbd,2),RIGHT),
284 ((vbd,2),(nn,1),LEFT),
285 ((vbd,2),(nn,3),RIGHT),
286 ((nn,1),(det,0),LEFT)])
287 </pre>
288 </p></div>
289 </div>
291 </div>
293 <div id="outline-container-7" class="outline-2">
294 <h2 id="sec-7">7 Initialization </h2>
295 <div id="text-7">
297 <p><a href="/Users/kiwibird/Documents/Skole/V08/Probability/dmvccm/src/dmv.py">dmv-inits</a>
298 </p>
300 We go through the corpus, since the probabilities are based on how far
301 away in the sentence arguments are from their heads.
302 </p>
303 </div>
305 <div id="outline-container-7.1" class="outline-3">
306 <h3 id="sec-7.1">7.1 <span class="todo">TOGROK</span> CCM Initialization </h3>
307 <div id="text-7.1">
309 <p>P<sub>SPLIT</sub> used here&hellip; how, again?
310 </p></div>
311 </div>
313 </div>
315 <div id="outline-container-8" class="outline-2">
316 <h2 id="sec-8">8 <span class="todo">TODO</span> [#C] Alternative CNF for DMV</h2>
317 <div id="text-8">
321 <a name="dmv2cnf">&nbsp;</a>
322 </p><ul>
323 <li>
324 <a href="src/cnf_dmv.py">cnf_dmv.py</a>
325 </li>
326 <li>
327 <a href="src/cnf_harmonic.py">cnf_harmonic.py</a>
329 </li>
330 </ul>
332 <p>See section 5 of <a href="tex/formulas.pdf">formulas.pdf</a>.
333 </p>
335 Given a grammar with certain p_ATTACH, p_STOP and p_ROOT, we get:
336 <pre class="example">
337 &gt;&gt;&gt; print testgrammar_h():
338 h&gt;&lt; --&gt; h&gt; STOP [0.30]
339 h&gt;&lt; --&gt; &gt;h&gt; STOP [0.40]
340 _h_ --&gt; STOP h&gt;&lt; [1.00]
341 _h_ --&gt; STOP &lt;h&gt;&lt; [1.00]
342 &gt;h&gt; --&gt; h&gt; _h_ [1.00]
343 &gt;h&gt; --&gt; &gt;h&gt; _h_ [1.00]
344 &lt;h&gt;&lt; --&gt; _h_ h&gt;&lt; [0.70]
345 &lt;h&gt;&lt; --&gt; _h_ &lt;h&gt;&lt; [0.60]
346 ROOT --&gt; STOP _h_ [1.00]
347 </pre>
348 </p>
350 </div>
352 <div id="outline-container-8.1" class="outline-3">
353 <h3 id="sec-8.1">8.1 <span class="todo">TODO</span> [#A] Make and implement an equivalent grammar that's <i>pure</i> CNF</h3>
354 <div id="text-8.1">
356 <p>&hellip;since I'm not sure about my unary reestimation rules (section 5 of
357 <a href="tex/formulas.pdf">formulas</a>).
358 </p>
360 For any rule where LHS is <code>_h_</code> we also have a corresponding one with
361 LHS <code>ROOT</code>, only difference being that we multiply in <code>p_ROOT(h)</code>.
362 </p>
364 For any rule where LHS is <code>.h&gt;</code>, we use adjacent probabilities for the
365 left child; if LHS is <code>&lt;h.</code> we use adjacent probabilities for the right
366 child. Only <code>_h_</code> and <code>_h&gt;_</code> (plus <code>ROOT</code>) get to introduce the pre-terminal
367 <code>h</code> (where <code>h</code>, <code>ROOT</code> and <code>_h_</code> all rewrite to the terminal
368 <code>'h'</code>), and only <code>_h_</code> and <code>_h&gt;_</code> (plus <code>ROOT</code>) act as STOP
369 rules (eg. get to multiply in <code>p(STOP)</code>).
370 </p>
372 <pre class="example">
373 h --&gt; 'h' 1
374 _h_ --&gt; 'h' p(STOP|h,L,adj) * p(STOP|h,R,adj)
375 ROOT --&gt; 'h' p(STOP|h,L,adj) * p(STOP|h,R,adj) * p_ROOT(h)
377 _h_ --&gt; h _a_ p(STOP|h,L,adj) * p(STOP|h,R,non) * p(a|h,R)*p(-STOP|h,R,adj)
378 _h_ --&gt; h .h&gt; p(STOP|h,L,adj) * p(STOP|h,R,non)
379 .h&gt; --&gt; _a_ _b_ p(a|h,R)*p(-STOP|h,R,adj) * p(b|h,R)*p(-STOP|h,R,non)
380 .h&gt; --&gt; _a_ h&gt; p(a|h,R)*p(-STOP|h,R,adj)
381 h&gt; --&gt; _a_ _b_ p(a|h,R)*p(-STOP|h,R,non) * p(b|h,R)*p(-STOP|h,R,non)
382 h&gt; --&gt; _a_ h&gt; p(a|h,R)*p(-STOP|h,R,non)
384 _h_ --&gt; _a_ h p(STOP|h,L,non) * p(STOP|h,R,adj) * p(a|h,L)*p(-STOP|h,L,adj)
385 _h_ --&gt; &lt;h. h p(STOP|h,L,non) * p(STOP|h,R,adj)
386 &lt;h. --&gt; _b_ _a_ p(b|h,L)*p(-STOP|h,L,non) * p(a|h,L)*p(-STOP|h,L,adj)
387 &lt;h. --&gt; &lt;h _a_ p(a|h,L)*p(-STOP|h,L,adj)
388 &lt;h --&gt; _a_ _b_ p(a|h,L)*p(-STOP|h,L,non) * p(b|h,L)*p(-STOP|h,L,non)
389 &lt;h --&gt; &lt;h _a_ p(a|h,L)*p(-STOP|h,L,non)
391 _h_ --&gt; &lt;h. _h&gt;_ p(STOP|h,L,non)
392 _h_ --&gt; _a_ _h&gt;_ p(STOP|h,L,non) * p(a|h,L)*p(-STOP|h,L,adj)
393 _h&gt;_ --&gt; h .h&gt; p(STOP|h,R,non)
394 _h&gt;_ --&gt; h _a_ p(STOP|h,R,non) * p(a|h,R)*p(-STOP|h,R,adj)
396 ROOT --&gt; h _a_ p(STOP|h,L,adj) * p(STOP|h,R,non) * p(a|h,R)*p(-STOP|h,R,adj) * p_ROOT(h)
397 ROOT --&gt; h .h&gt; p(STOP|h,L,adj) * p(STOP|h,R,non) * p_ROOT(h)
399 ROOT --&gt; _a_ h p(STOP|h,L,non) * p(STOP|h,R,adj) * p(a|h,L)*p(-STOP|h,L,adj) * p_ROOT(h)
400 ROOT --&gt; &lt;h. h p(STOP|h,L,non) * p(STOP|h,R,adj) * p_ROOT(h)
402 ROOT --&gt; &lt;h. _h&gt;_ p(STOP|h,L,non) * p_ROOT(h)
403 ROOT --&gt; _a_ _h&gt;_ p(STOP|h,L,non) * p(a|h,L)*p(-STOP|h,L,adj) * p_ROOT(h)
405 </pre>
406 </p>
408 Since we have rules rewriting <code>h</code> to <code>a</code> and <code>b</code>, we have a rule-set
409 numbering more than n<sub>tags</sub><sup>2</sup>.
410 </p>
411 </div>
413 </div>
415 <div id="outline-container-8.2" class="outline-3">
416 <h3 id="sec-8.2">8.2 <span class="todo">TOGROK</span> [#A] convert L&amp;Y-based reestimation into P_ATTACH and P_STOP values</h3>
417 <div id="text-8.2">
419 <p>Sum over the various rules? Or something? Must think of this.
420 </p></div>
422 </div>
424 <div id="outline-container-8.3" class="outline-3">
425 <h3 id="sec-8.3">8.3 <span class="todo">TODO</span> [#C] move as much as possible into common_dmv.py</h3>
426 <div id="text-8.3">
428 <p><a href="src/common_dmv.py">common_dmv.py</a>
429 </p></div>
431 </div>
433 <div id="outline-container-8.4" class="outline-3">
434 <h3 id="sec-8.4">8.4 <span class="done">DONE</span> L&amp;Y-based reestimation for cnf_dmv</h3>
435 <div id="text-8.4">
437 <p><span class="timestamp-kwd">CLOSED: </span> <span class="timestamp">2008-08-21 Thu 16:35</span><br/>
438 </p></div>
440 </div>
442 <div id="outline-container-8.5" class="outline-3">
443 <h3 id="sec-8.5">8.5 <span class="done">DONE</span> dmv2cnf re-estimation formulas</h3>
444 <div id="text-8.5">
446 <p><span class="timestamp-kwd">CLOSED: </span> <span class="timestamp">2008-08-21 Thu 16:36</span><br/>
447 </p></div>
449 </div>
451 <div id="outline-container-8.6" class="outline-3">
452 <h3 id="sec-8.6">8.6 <span class="done">DONE</span> inner and outer for cnf_dmv.py, also cnf_harmonic.py </h3>
453 <div id="text-8.6">
455 </div>
456 </div>
458 </div>
460 <div id="outline-container-9" class="outline-2">
461 <h2 id="sec-9">9 [#C] Deferred</h2>
462 <div id="text-9">
464 <p><a href="http://wiki.python.org/moin/PythonSpeed/PerformanceTips">http://wiki.python.org/moin/PythonSpeed/PerformanceTips</a> Eg., use
465 map/reduce/filter/[i for i in [i's]]/(i for i in [i's]) instead of
466 for-loops; use local variables for globals (global variables or or
467 functions), etc.
468 </p>
469 </div>
471 <div id="outline-container-9.1" class="outline-3">
472 <h3 id="sec-9.1">9.1 <span class="todo">TODO</span> Clean up reestimation code &nbsp;&nbsp;&nbsp;<span class="tag">PRETTIER</span></h3>
473 <div id="text-9.1">
475 </div>
477 </div>
479 <div id="outline-container-9.2" class="outline-3">
480 <h3 id="sec-9.2">9.2 <span class="todo">TODO</span> [#A] compare speed of w_left/right(&hellip;) and w(LEFT/RIGHT, &hellip;) &nbsp;&nbsp;&nbsp;<span class="tag">OPTIMIZE</span></h3>
481 <div id="text-9.2">
483 </div>
485 </div>
487 <div id="outline-container-9.3" class="outline-3">
488 <h3 id="sec-9.3">9.3 <span class="todo">TODO</span> when reestimating P_STOP etc, remove rules with p &lt; epsilon &nbsp;&nbsp;&nbsp;<span class="tag">OPTIMIZE</span></h3>
489 <div id="text-9.3">
491 </div>
493 </div>
495 <div id="outline-container-9.4" class="outline-3">
496 <h3 id="sec-9.4">9.4 <span class="todo">TODO</span> inner_dmv, short ranges and impossible attachment &nbsp;&nbsp;&nbsp;<span class="tag">OPTIMIZE</span></h3>
497 <div id="text-9.4">
499 <p>If s-t &lt;= 2, there can be only one attachment below, so don't recurse
500 with both Lattach=True and Rattach=True.
501 </p>
503 If s-t &lt;= 1, there can be no attachment below, so only recurse with
504 Lattach=False, Rattach=False.
505 </p>
507 Put this in the loop under rewrite rules (could also do it in the STOP
508 section, but that would only have an effect on very short sentences).
509 </p></div>
511 </div>
513 <div id="outline-container-9.5" class="outline-3">
514 <h3 id="sec-9.5">9.5 <span class="todo">TODO</span> clean up the module files &nbsp;&nbsp;&nbsp;<span class="tag">PRETTIER</span></h3>
515 <div id="text-9.5">
517 <p>Is there better way to divide dmv and harmonic? There's a two-way
518 dependency between the modules. Guess there could be a third file that
519 imports both the initialization and the actual EM stuff, while a file
520 containing constants and classes could be imported by all others:
521 <pre class="example">
522 dmv.py imports dmv_EM.py imports dmv_classes.py
523 dmv.py imports dmv_inits.py imports dmv_classes.py
524 </pre>
525 </p>
526 </div>
528 </div>
530 <div id="outline-container-9.6" class="outline-3">
531 <h3 id="sec-9.6">9.6 <span class="todo">TOGROK</span> Some (tagged) sentences are bound to come twice &nbsp;&nbsp;&nbsp;<span class="tag">OPTIMIZE</span></h3>
532 <div id="text-9.6">
534 <p>Eg, first sort and count, so that the corpus
535 [['nn','vbd','det','nn'],
536 ['vbd','nn','det','nn'],
537 ['nn','vbd','det','nn']]
538 becomes
539 [(['nn','vbd','det','nn'],2),
540 (['vbd','nn','det','nn'],1)]
541 and then in each loop through sentences, make sure we handle the
542 frequency correctly.
543 </p>
545 Is there much to gain here?
546 </p>
547 </div>
549 </div>
551 <div id="outline-container-9.7" class="outline-3">
552 <h3 id="sec-9.7">9.7 <span class="todo">TOGROK</span> tags as numbers or tags as strings? &nbsp;&nbsp;&nbsp;<span class="tag">OPTIMIZE</span></h3>
553 <div id="text-9.7">
555 <p>Need to clean up the representation.
556 </p>
558 Stick with tag-strings in initialization then switch to numbers for
559 IO-algorithm perhaps? Can probably afford more string-matching in
560 initialization..
561 </p></div>
562 </div>
564 </div>
566 <div id="outline-container-10" class="outline-2">
567 <h2 id="sec-10">10 Adjacency and combining it with the inside-outside algorithm</h2>
568 <div id="text-10">
570 <p>Each DMV probability (for a certain PCFG node) has both an adjacent
571 and a non-adjacent probability. inner() and outer() needs the correct
572 one in each case.
573 </p>
575 In each inner() call, loc_h is the location of the head of this
576 dependency structure. In each outer() call, it's the head of the <i>Node</i>,
577 the structure we're looking outside of.
578 </p>
580 We call inner() for each location of a head, and on each terminal,
581 loc_h must equal <code>i</code> (and <code>loc_h+1</code> equal <code>j</code>). In the recursive attachment
582 calls, we use the locations (sentence indices) of words to the left or
583 right of the head in calls to inner(). <i>loc_h lets us check whether we need probN or probA</i>.
584 </p>
585 </div>
587 <div id="outline-container-10.1" class="outline-3">
588 <h3 id="sec-10.1">10.1 Possible alternate type of adjacency</h3>
589 <div id="text-10.1">
591 <p>K&amp;M's adjacency is just whether or not an argument has been generated
592 in the current direction yet. One could also make a stronger type of
593 adjacency, where h and a are not adjacent if b is in between, eg. with
594 the sentence "a b h" and the structure ((h-&gt;a), (a-&gt;b)), h is
595 K&amp;M-adjacent to a, but not next to a, since b is in between. It's easy
596 to check this type of adjacency in inner(), but it needs new rules for
597 P_STOP reestimation.
598 </p></div>
599 </div>
601 </div>
603 <div id="outline-container-11" class="outline-2">
604 <h2 id="sec-11">11 Python-stuff</h2>
605 <div id="text-11">
607 <p>Make those debug statements steal a bit less attention in emacs:
608 <pre class="example">
609 (font-lock-add-keywords
610 'python-mode ; not really regexp, a bit slow
611 '(("^\\( *\\)\\(\\if +'.+' +in +io.DEBUG. *\\(
612 \\1 .+$\\)+\\)" 2 font-lock-preprocessor-face t)))
613 (font-lock-add-keywords
614 'python-mode
615 '(("\\&lt;\\(\\(io\\.\\)?debug(.+)\\)" 1 font-lock-preprocessor-face t)))
616 </pre>
617 </p>
618 <ul>
619 <li>
620 <a href="src/pseudo.py">pseudo.py</a>
621 </li>
622 <li>
623 <a href="http://nltk.org/doc/en/structured-programming.html">http://nltk.org/doc/en/structured-programming.html</a> recursive dynamic
624 </li>
625 <li>
626 <a href="http://nltk.org/doc/en/advanced-parsing.html">http://nltk.org/doc/en/advanced-parsing.html</a>
627 </li>
628 <li>
629 <a href="http://jaynes.colorado.edu/PythonIdioms.html">http://jaynes.colorado.edu/PythonIdioms.html</a>
633 </li>
634 </ul>
635 </div>
637 </div>
639 <div id="outline-container-12" class="outline-2">
640 <h2 id="sec-12">12 Git</h2>
641 <div id="text-12">
643 <p>Repository web page: <a href="http://repo.or.cz/w/dmvccm.git">http://repo.or.cz/w/dmvccm.git</a>
644 </p>
646 Setting up a new project:
647 <pre class="example">
648 git init
649 git add .
650 git commit -m "first release"
651 </pre>
652 </p>
654 Later on: (<code>-a</code> does <code>git rm</code> and <code>git add</code> automatically)
655 <pre class="example">
656 git init
657 git commit -a -m "some subsequent release"
658 </pre>
659 </p>
661 Then push stuff up to the remote server:
662 <pre class="example">
663 git push git+ssh://username@repo.or.cz/srv/git/dmvccm.git master
664 </pre>
665 </p>
667 (<code>eval `ssh-agent`</code> and <code>ssh-add</code> to avoid having to type in keyphrase all
668 the time)
669 </p>
671 Make a copy of the (remote) master branch:
672 <pre class="example">
673 git clone git://repo.or.cz/dmvccm.git
674 </pre>
675 </p>
677 Make and name a new branch in this folder
678 <pre class="example">
679 git checkout -b mybranch
680 </pre>
681 </p>
683 To save changes in <code>mybranch</code>:
684 <pre class="example">
685 git commit -a
686 </pre>
687 </p>
689 Go back to the master branch (uncommitted changes from <code>mybranch</code> are
690 carried over):
691 <pre class="example">
692 git checkout master
693 </pre>
694 </p>
696 Try out:
697 <pre class="example">
698 git add --interactive
699 </pre>
700 </p>
702 Good tutorial:
703 <a href="http://www-cs-students.stanford.edu/~blynn//gitmagic/">http://www-cs-students.stanford.edu/~blynn//gitmagic/</a>
704 </p></div>
705 </div>
706 <div id="postamble"><p class="author"> Author: Kevin Brubeck Unhammer
707 <a href="mailto:K.BrubeckUnhammer at student uva nl ">&lt;K.BrubeckUnhammer at student uva nl &gt;</a>
708 </p>
709 <p class="date"> Date: 2008-09-21 19:00:58 CEST</p>
710 <p>HTML generert av <a href='http://orgmode.org/'>org-mode</a> 6.06b in emacs 22<p>
711 </div><script src="./post-script.js" type="text/JavaScript">
712 </script></body>
713 </html>