From 65e9744d5b8cda576efaa04405366c28030d2901 Mon Sep 17 00:00:00 2001 From: Kevin Brubeck Unhammer Date: Mon, 16 Jun 2008 00:06:59 +0200 Subject: [PATCH] PSTOP done, PCHOOSE close to done--but still gives certain p's > 1.0, needs debugging --- DMVCCM.html | 423 +++++++++++++++++++++++++++++++---------------------- DMVCCM.org | 122 ++++++--------- DMVCCM.org_archive | 81 +++++++++- src/dmv.py | 59 +++++--- src/dmv.pyc | Bin 24992 -> 24352 bytes 5 files changed, 412 insertions(+), 273 deletions(-) diff --git a/DMVCCM.html b/DMVCCM.html index e72cb06..a202c0e 100755 --- a/DMVCCM.html +++ b/DMVCCM.html @@ -6,26 +6,59 @@ lang="en" xml:lang="en"> DMV/CCM – todo-list / progress - +

DMV/CCM – todo-list / progress

+

Table of Contents

-

2 DONE outer probabilities

+

2 P_STOP and P_CHOOSE for IO/EM (reestimation)

+

dmv-P_STOP +Remember: The PSTOP formula is upside-down (left-to-right also). +(In the article..not the thesis) +

+ +
+ +
+

2.1 TODO [#A] P_CHOOSE formula:

+
+ +
    +
  • +Note taken on 2008-06-15 Sun 23:41
    +Mostly done, although there is something odd with either the formulas +or the way I calculate them; I get probabilities more than 1.0 after +doing r.probN = (1-p_stopN)*p_choose and r.probA = (1-p_stopA)*p_choose. + +
  • +
+ +

Assuming this: +

+ ++ + + + + + + +
PCHOOSE(a : h,R) =corpuss=loc(h)t > loc(h)loc(h) < r <= t c(s, r-1, h_, …) * c(r,t,_a_,…)
corpuss=loc(h)t > loc(h) c(s,t,h, …)
PCHOOSE(a : h,L) =corpuss<loc(h)t>=loc(h)r<loc(h) c(s,r,_a_, …) * c(r+1, t, h_, …)
corpuss<loc(h)t>=loc(h) c(s,t,h_,…)
+ +t >= loc(h) since there are many possibilites for right-attachments +below, and each of them alone gives a lower probability (through +multiplication) to the upper tree (so add them all) + +

+The reason we have to check both children of the attachments is that we +have to make sure they are contiguous (otherwise we would have no way +of ruling out eg. h_->_b_,_b_->b_->_a_, where h_ covers s and t,_b_ is +from s to x<r and _ a_ is from s to r). +

+ +
+ +
+

2.2 DONE P_STOP formulas for various dir and adj:

+
+ +

CLOSED: 2008-06-15 Sun 23:40
+Assuming this: +

+ ++ + + + + + + + + + + + + +
PSTOP(STOP : h,L,non_adj) =corpuss<loc(h)t>=loc(h) c(s,t,_h_, …)
corpuss<loc(h)t>=loc(h) c(s,t,h_, …)
PSTOP(STOP : h,L,adj) =corpuss=loc(h)t>=loc(h) c(s,t,_h_, …)
corpuss=loc(h)t>=loc(h) c(s,t,h_, …)
PSTOP(STOP : h,R,non_adj) =corpuss=loc(h)t>loc(h) c(s,t,h_, …)
corpuss=loc(h)t>loc(h) c(s,t,h, …)
PSTOP(STOP : h,R,adj) =corpuss=loc(h)t=loc(h) c(s,t,h_, …)
corpuss=loc(h)t=loc(h) c(s,t,h, …)
+ + +

+(And PSTOP(-STOP|…) = 1 - PSTOP(STOP|…) ) +

+
+ +
+ +
+

3 DONE outer probabilities

+
+

CLOSED: 2008-06-12 Thu 11:11
  There are 6 different configurations here, based on what the above @@ -197,81 +316,7 @@ Also, unlike in -

3 TODO [#B] P_STOP and P_CHOOSE for IO/EM (reestimation)

-
- -

dmv-P_STOP -Remember: The PSTOP formula is upside-down (left-to-right also). -(In the article..not the thesis) -

-
    -
  • P_STOP formulas for various dir and adj:
    -Assuming this: - - -- - - - - - - - - - - - - -
    PSTOP(STOP : h,L,non_adj) =corpuss<loc(h)t>=loc(h) c(s,t,_h_, …)
    corpuss<loc(h)t>=loc(h) c(s,t,h_, …)
    PSTOP(STOP : h,L,adj) =corpuss=loc(h)t>=loc(h) c(s,t,_h_, …)
    corpuss=loc(h)t>=loc(h) c(s,t,h_, …)
    PSTOP(STOP : h,R,non_adj) =corpuss=loc(h)t>loc(h) c(s,t,h_, …)
    corpuss=loc(h)t>loc(h) c(s,t,h, …)
    PSTOP(STOP : h,R,adj) =corpuss=loc(h)t=loc(h) c(s,t,h_, …)
    corpuss=loc(h)t=loc(h) c(s,t,h, …)
    - - -

    -(And PSTOP(-STOP|…) = 1 - PSTOP(STOP|…) ) -

  • -
  • P_CHOOSE formula:
    -Assuming this: - - -- - - - - - - -
    PCHOOSE(a : h,R) =corpuss=loc(h)t > loc(h)loc(h) < r <= t c(s, r-1, h_, …) * c(r,t,_a_,…)
    corpuss=loc(h)t > loc(h) c(s,t,h, …)
    PCHOOSE(a : h,L) =corpuss<loc(h)t>=loc(h)r<loc(h) c(s,r,_a_, …) * c(r+1, t, h_, …)
    corpuss<loc(h)t>=loc(h) c(s,t,h_,…)
    - -t >= loc(h) since there are many possibilites for right-attachments -below, and each of them alone gives a lower probability (through -multiplication) to the upper tree (so add them all) - -

    -The reason we have to check both children of the attachments is that we -have to make sure they are contiguous (otherwise we would have no way -of ruling out eg. h_->_b_,_b_->b_->_a_, where h_ covers s and t,_b_ is -from s to x<r and _ a_ is from s to r). -

  • -
  • DONE [#A] How do we only count from completed trees?
    -CLOSED: 2008-06-13 Fri 11:40
    -Use c(s,t,Node); inner * outer / P_sent - -
  • -
  • DONE [#A] c(s,t,Node)
    -CLOSED: 2008-06-13 Fri 11:38
    -= inner * outer / P_sent -

    -implemented as inner * outer / inner_sent -

    -
  • -
@@ -301,11 +346,15 @@ for rewrites, for loc_dep select the one that has the highest p

Page 108 (pdf: 124) of Klein's thesis gives info about this. -

    -
  • TODO Make inner() and outer() also allow left-first attachment
    -Using PORDER(left-first | w) etc. -
  • -
+

+
+ +
+

5.1 TODO Make inner() and outer() also allow left-first attachment

+
+ +

Using PORDER(left-first | w) etc. +

@@ -327,22 +376,50 @@ Using PORDER(left-first | w) etc.

We go through the corpus, since the probabilities are based on how far away in the sentence arguments are from their heads. -