DMVCCM.org.~2~

   1 # -*- coding: mule-utf-8-unix -*-
   2 #+STARTUP: overview
   3 #+TAGS: OPTIMIZE
   4 #+STARTUP: hidestars
   5 #+TITLE: DMV/CCM
   6 #+AUTHOR: Kevin Brubeck Unhammer
   7 #+EMAIL: K.BrubeckUnhammer at student uva nl
   8 #+LANGUAGE: en
   9 #+SEQ_TODO: TOGROK TODO DONE
  10 #+OPTIONS: H:4 toc:3
  11
  12 * dmvccm
  13   DEADLINE: <2008-06-30 Mon>
  14 (But absolute, extended, really-quite-dead-now deadline: August 31...)
  15 [[file:src/dmv.py][dmv.py]]
  16 [[file:src/io.py][io.py]]
  17 ** TODO [#A] P_STOP
  18 [[file:src/dmv.py::DMV%20probabilities][dmv-P_STOP]]
  19 *** Remember: The P_STOP formula is upside-down (left-to-right also)
  20 *** How is the P_STOP formula different given other values for dir and adj?
  21 (Presumably, the P_STOP formula where STOP is True is just the
  22 rule-probability of _h_ -> STOP h_ or h_ -> h STOP, but how does
  23 adjacency fit in here?)
  24
  25 (And P_STOP(-STOP|...) = 1 - P_STOP(STOP|...) )
  26 ** TODO P_CHOOSE
  27 Write the formulas! should be easy?
  28 ** TOGROK Some (tagged) sentences are bound to come twice             :OPTIMIZE:
  29 Eg, first sort and count, so that the corpus
  30 [['nn','vbd','det','nn'],
  31  ['vbd','nn','det','nn'],
  32  ['nn','vbd','det','nn']]
  33 becomes
  34 [(['nn','vbd','det','nn'],2),
  35  (['vbd','nn','det','nn'],1)]
  36 and then in each loop through sentences, make sure we handle the
  37 frequency correctly.
  38
  39 Is there much to gain here?
  40 ** Initialization
  41 [[file:~/Documents/Skole/V08/Probability/dmvccm/src/dmv.py::Initialization%20todo][dmv-inits]]
  42
  43 We do have to go through the corpus, since the probabilities are based
  44 on how far away in the sentence arguments are from their heads.
  45 *** DONE DMV Initialization frequencies
  46     CLOSED: [2008-05-27 Tue 20:04]
  47 **** P_STOP
  48 P_STOP is not well defined by K&M. One possible interpretation given
  49 the sentence [det nn vb nn] is
  50 - f_STOP( STOP|det, L, adj) +1
  51 - f_STOP(-STOP|det, L, adj) +0
  52 - f_STOP( STOP|det, L, non_adj) +1
  53 - f_STOP(-STOP|det, L, non_adj) +0
  54 - f_STOP( STOP|det, R, adj) +0
  55 - f_STOP(-STOP|det, R, adj) +1
  56
  57 - f_STOP( STOP|nn, L, adj) +0
  58 - f_STOP(-STOP|nn, L, adj) +1
  59 - f_STOP( STOP|nn, L, non_adj) +1  # since there's at least one to the left
  60 - f_STOP(-STOP|nn, L, non_adj) +0
  61
  62 **** P_CHOOSE
  63 Go through the corpus, counting distances between heads and
  64 arguments. In [det nn vb nn], we give
  65 - f_CHOOSE(nn|det, R) +1/1 + C
  66 - f_CHOOSE(vb|det, R) +1/2 + C
  67 - f_CHOOSE(nn|det, R) +1/3 + C
  68   - If this were the full corpus, P_CHOOSE(nn|det, R) would have
  69     (1+1/3+2C) / sum_a f_CHOOSE(a|det, R)
  70
  71 The ROOT gets "each argument with equal probability", so in a sentence
  72 of three words, 1/3 for each (in [nn vb nn], 'nn' gets 2/3). Basically
  73 just a frequency count of the corpus...
  74 *** TOGROK DMV Initialization probabilities (from initialization frequency)
  75 *** TOGROK CCM Initialization
  76 P_SPLIT used here... how, again?
  77
  78
  79
  80 ** TOGROK Adjacency and combining it with inner()
  81 ** TOGROK Is the E-step in DMV just inner() on the full sentence?
  82 and What exactly is the M-step of DMV?
  83 *** COMMENT
  84 E: «For the initial symbols, I think we're supposed to
  85 implement someone-or-other's POS tagger and use the parts of speech as
  86 our tags. (K&M say they used parts of speech, anyway, and we can't use
  87 the ones from the annotated corpus or else we won't be unsupervised
  88 anymore.) I think you're right about how the tree is formed. The only
  89 thing I'm not sure about is whether those left and right marks are
  90 supposed to be part of the tree. I guess not because in a well-formed
  91 tree, every word needs a stop on both sides, so actually all nodes
  92 would be left and right marked in the end, which is the same as not
  93 using them at all. So I guess they're relevant in the E step of EM,
  94 but not the M step.» [[https://mail.google.com/mail/%3Ffs%3D1&tf%3D1&source%3Datom&view%3Dcv&search%3Dall&th%3D119b3e3d7d8f1d1f&shva%3D1&ui%3D1&disablechatbrowsercheck%3D1][gmail]]
  95
  96 K: About initialization, when they say they start with an M-step and get
  97 a harmonic distribution (where close dependencies have higher
  98 probabilities than far-away ones), I guess we could do this either in
  99 a function of its own, or as a special case of the IO-algorithm -- I'm
 100 not sure how different the harmonizing step will be from the regular
 101 M-step. But I'm pretty sure harmonizing means having to go through the
 102 whole corpus, since that's the only way we can find out distances for
 103 each dependency (without having actual distance-numbers in the rules
 104 in the first place). If sentence s is ABCD, rules from A to B should
 105 have higher probability than rules from A to D. Assuming we find some
 106 average distance between A and B, and A and D, etc. throughout the
 107 corpus, we then have to turn those numbers into probabilities that sum
 108 to 1; do you have any idea how to do that?
 109
 110 E: I think what it means to start with an M step is that we set the
 111 probabilities without regard for the initial corpus. They say that
 112 they start with an initial guess at "posterior distributions". I bet
 113 the word "posterior" is important, but I don't know what it means
 114 here. Do you?
 115 ** TODO [#C] Corpus access
 116 ** TOGROK How do we interpret DMV as an inside/outside process?
 117 c_s(x : i, j) is "the expected fraction of parses of s" with x from
 118 i to j; expectation then uses the probabilities gotten from
 119 initialization and previously gained probabilities, but these are of
 120 the form P_STOP and P_CHOOSE, how do we translate this to inside
 121 outside, which just uses the probabilities of CFG-rules?
 122
 123 ** Technical: sentences or rules as the "outer loop"?
 124 ** CCM-stuff
 125 ** DONE How do we know whether we are 'adjacent' or not?
 126 **** One configuration that I'm fairly certain of: right w/CHOOSE
 127 if we have
 128 \Tree [_b [_b b _c_ ] _d_ ]
 129 then the lower tree [_b b _c_ ] is adjacent since, working your way up
 130 the tree, no argument has been created to the right "yet"; while the
 131 outer tree [_b [_b ... ] _d_ ] is non-adjacent, since there is something in
 132 between... Is it thus always adjacent to the right if the distance
 133 is 2? (That is, in e(s,t,i) for the adjacent rule: t - s == 2; while
 134 in the non_adj rule: t - s == 4)
 135 ***** Implementing this:
 136 Two different DMVRules? Or just two different prob-values per rule?
 137 **** left w/CHOOSE
 138 Same deal here?
 139 **** R/L without CHOOSE, the "sealing operations"
 140 _h_ -> STOP h_ and h_ -> h STOP
 141
 142 What is "adjacency" here? That t - s == 1?
 143
 144
 145
 146
 147 * Python-stuff
 148 - [[file:src/pseudo.py][pseudo.py]]
 149 - http://nltk.org/doc/en/structured-programming.html recursive dynamic
 150 - http://nltk.org/doc/en/advanced-parsing.html
 151
 152