report/report.tex

   1 % merged Kevin's & Emily's reports 2008-09-21 Sun 10:39 with Emerge
   2
   3 \documentclass[11pt,a4paper]{article}
   4 \usepackage[utf8]{inputenc}
   5 \usepackage[T1]{fontenc}
   6 \usepackage{hyperref}
   7 \usepackage{natbib}
   8
   9 \usepackage{pslatex}
  10 \usepackage{pdfsync}
  11 \pdfoutput=1
  12
  13 \usepackage{qtree}
  14 \usepackage{amsmath}
  15 \usepackage{amssymb}
  16
  17 \usepackage{avm}
  18 \avmfont{\sc}
  19 \avmoptions{sorted,active}
  20 \avmvalfont{\rm}
  21 \avmsortfont{\scriptsize\it}
  22
  23 \usepackage{array} % for a tabular with math in it
  24
  25 \title{The DMV and CCM models}
  26 \author{Emily Morgan \& Kevin Brubeck Unhammer}
  27 \date{15 September 2008}
  28
  29 \begin{document}
  30
  31 \maketitle
  32
  33 \tableofcontents
  34
  35 \section{Introduction}
  36 \citet{klein-thesis} describes two models for the unsupervised
  37 learning of syntactic structure. One is the Constituent Context Model
  38 (CCM), which uses the context as well as the content of a substring to determine whether it
  39 is a constituent. The context of a substring is defined to be the words
  40 immediately preceding and following the substring. The intuition is that context can be a reliable cue for constituentness because when a long phrase is a constituent, there is often a similar case where a single word will serve as a constituent in the same context. Thus to determine the
  41 likelihood that a substring is a constituent, this model considers both
  42 the likelihood of the substring itself as a constituent and the
  43 likelihood of its context as the context of a constituent.
  44
  45 The other is a Dependency Model with Valence (DMV), which uses a form
  46 of Dependency Grammar. Here sentence structure is represented as pairs
  47 of heads and dependents (arguments), although it is also possible to
  48 represent this as a CFG (see \citet[p.~106]{klein-thesis}).
  49
  50 Apart from identity of the argument, probabilities are conditional on
  51 direction of attachment. A valence condition is introduced by using
  52 stopping probabilities: a head may take zero, one or more dependents
  53 before ``stopping'', and there are different probabilities for
  54 stopping without taking arguments (\emph{adjacently}) versus after
  55 taking one or more (\emph{non-adjacently}).
  56
  57 Versions of the inside-outside algorithm for PCFG's \citep{lari-csl90}
  58 are used for running estimation maximation on both these probabilistic
  59 models. We tried implementing both the CCM and the DMV. Below we give
  60 the details of how this was done, along with the results.
  61
  62 \section{The Constituent Context Model}
  63 \subsection{Basics}
  64 The CCM model \citep{klein-thesis, km-ccm} is a generative model over a set of Part Of Speech (POS) tags $\mathbb{T}$ and a beginning/end of sentence symbol $\diamond$.
  65 The CCM model has two parameters:
  66 \begin{itemize}
  67 \item $P_{SPAN}(\alpha | c)$ is the probability of generating a span $\alpha$ given that it is/is not a constituent; $\alpha \in \mathbb{T}^*, c \in \{\text{true}, \text{false}\}$.
  68 \item $P_{CONTEXT}(\beta | c)$ is the probability of generating a context $\beta$ given that the span it surrounds is/is not a constituent; $\beta \in (\mathbb{T}\cup\diamond)^2,c \in \{\text{true}, \text{false}\}$.
  69 \end{itemize}
  70 The probability of a sentence $s$ with bracketing $B$ is the probability of that bracketing times the probability of the span and the context for each substring of $s$. The probability of bracketings are the fixed distribution $P_\text{bin}$, which assigns equal weight to all bracketings that correspond to binary trees. For a sentence $s$ of length $n$, we index spaces between words in the sentence, including the leading space (0) and the trailing space ($n$). Let $\langle i,j\rangle_s$ be the substring from $i$ to $j$, and $\rangle i,j\langle_s$ be the context of that substring. Then we have: $$P(s,B)=P_{\text{bin}}(B)\displaystyle\prod_{\{\langle i,j\rangle_s\}}P_{\text{SPAN}}(\langle i,j\rangle_s|B_{ij})P_{\text{CONTEXT}}(\rangle i,j\langle_s|B_{ij})$$
  71
  72 \subsection{Reestimation}
  73 We use the EM algorithm to induce our model parameters. Given the parameters $P_{\text{SPAN}}$ and $P_{\text{CONTEXT}}$, we wish to first find the likelihoods of completions for all sentences in a corpus given the current parameter values, and then find new parameters that maximize the corpus likelihood given the previously predicted completions. Because enumerating all bracketings for all sentences in computationally infeasible, we use a dynamic programming algorithm similar to the inside-outside algorithm.
  74
  75 We begin by rewriting the probability $P(s,B)$ given above:
  76 $$P(s,B)=P_{\text{bin}}(B)\displaystyle\prod_{\{\langle i,j\rangle_s\}}P_{\text{SPAN}}(\langle i,j\rangle_s|\text{false})P_{\text{CONTEXT}}(\rangle i,j\langle_s|\text{false}) \times$$$$ \prod_{\{\langle i,j\rangle_s\in B\}}\frac{P_{\text{SPAN}}(\langle i,j\rangle_s|\text{true})P_{\text{CONTEXT}}(\rangle i,j\langle_s|\text{true})}{P_{\text{SPAN}}(\langle i,j\rangle_s|\text{false})P_{\text{CONTEXT}}(\rangle i,j\langle_s|\text{false})}$$
  77 We can then condense this by letting $$\phi(i,j,s)=\frac{P_{\text{SPAN}}(\langle i,j\rangle_s|\text{true})P_{\text{CONTEXT}}(\rangle i,j\langle_s|\text{true})}{P_{\text{SPAN}}(\langle i,j\rangle_s|\text{false})P_{\text{CONTEXT}}(\rangle i,j\langle_s|\text{false})}$$
  78 Further, note that $P_{\text{bin}}$ is constant for all bracketings of a sentence. We can thus also write $$K(s)=P_{\text{bin}}(B)\displaystyle\prod_{\{\langle i,j\rangle_s\}}P_{\text{SPAN}}(\langle i,j\rangle_s|\text{false})P_{\text{CONTEXT}}(\rangle i,j\langle_s|\text{false})$$ and $K(s)$ is constant for a given sentence.
  79 We thus have a condensed formulation of $P(s,B)$:
  80 $$P(s,B)=K(s) \prod_{\{\langle i,j\rangle_s\in B\}}\phi(i,j,s)$$
  81
  82 In the spirit of the inside-outside algorithm, we can use a dynamic programming algorithm to calculate the product of $\phi(i,j,s)$ for a sentence. We wish to calculate the products of the $\phi(i,j,s)$ both inside and outside a given interval $\langle i,j\rangle$. Let $\mathcal{T}(m)$ be the set of binary tree structures with $m$ leaves. We then define $$I(i,j,s)=\displaystyle\sum_{T\in \mathcal{T}(j-i)}\prod_{\langle a,b\rangle:\langle a-i,b-i\rangle\in T}\phi(a,b,s)$$
  83 To calculate I, we want to find a recursive decomposition. In the case where $j=i$, there can be no constituents between $i$ and $j$, so we have $I(i,i,s)=0$. In the case where $j=i+1$, there is necessary exactly one constituent between $i$ and $j$ because all one-word substrings are constituents, so $I(i,i+1,s)=\phi(i,i+1,s)$. In the case where $j>i+1$, any constituent spanning from $i$ to $j$ must also contain subconstituents from $i$ to $k$ and from $k$ to $j$. Thus
  84 $$I(i,j,s)=\displaystyle\sum_{i<k<j}\sum_{T\in\mathcal{T}(k-i)}\sum_{T'\in\mathcal{T}(j-k)}\phi(i,j,s)\prod_{\langle a,b\rangle:\langle a-i,b-i\rangle\in T}\phi(a,b,s)\prod_{\langle a',b'\rangle:\langle a'-k,b'-k\rangle\in T'}\phi(a',b',s)$$
  85 $$=\phi(i,j,s)\displaystyle\sum_{i<k<j}\sum_{T\in\mathcal{T}(k-i)}\sum_{T'\in\mathcal{T}(j-k)}\prod_{\langle a,b\rangle:\langle a-i,b-i\rangle\in T}\phi(a,b,s)\prod_{\langle a',b'\rangle:\langle a'-k,b'-k\rangle\in T'}\phi(a',b',s)$$
  86 $$=\phi(i,j,s)\displaystyle\sum_{i<k<j}I(i,k,s)I(k,j,s)$$
  87
  88 In summary, we have a recursive decomposition for I:
  89 $$I(i,j,s) = \begin{cases}
  90                 \phi(i,j,s)\displaystyle\sum_{i<k<j}I(i,k,s)I(k,j,s) & \text{if } j-i>1 \\
  91                 \phi(i,j,s) & \text{if } j-i=1 \\
  92                 0 & \text{if } j-i = 0
  93         \end{cases}$$
  94
  95 Similarly, we define the outside probability
  96 $$O(i,j,s) = \begin{cases}
  97                 \displaystyle\sum_{0\leq k<i}I(k,i,s)\phi(k,j,s)O(k,j,s) + \sum_{j<k\leq n} I(j,k,s)\phi(i,k,s)O(i,k,s) &
  98                         \text{if } j-i < n \\
  99                 1 & \text{if } j-i = n
 100         \end{cases}$$
 101
 102 To reestimate the parameters $P_{\text{SPAN}}$ and $P_{\text{CONTEXT}}$, we wish to know $P_{\text{BRACKET}}(i,j,s)$, the fraction of bracketings of $s$ that contain $\langle i,j\rangle$ as a constituent:
 103 $$P_{\text{BRACKET}}(i,j,s)=\frac{\displaystyle\sum_{\{B:B_{ij}=\text{true}\}}P(s,B)}{\displaystyle\sum_{B}P(s,B)}$$
 104 We can rewrite this in terms of our inside and outside probabilities:
 105 $$\displaystyle\sum_{\{B:B_{ij}=\text{true}\}}P(s,B)=K(s)I(i,j,s)O(i,j,s)$$
 106 and
 107 $$\displaystyle\sum_{B}P(s,B)=K(s)I(0,n,s)O(0,n,s)$$
 108 But the $K(s)$ terms cancel, and $O(0,n,s)=1$, so we find that $$P_{\text{BRACKET}}(i,j,s)=\frac{I(i,j,s)O(i,j,s)}{I(0,n,s)}$$
 109
 110 From $P_{\text{BRACKET}}$ we can calculate the next iteration of the model parameters $P_{\text{SPAN}}$ and $P_{\text{CONTEXT}}$. First, for a sentence $s$ of length $n_s$, note that the number of spans in the sentence is $N_s=(n_s+1)(n_s+2)/2$, which includes the spans of length 0. Furthemore, let $\#_s(\gamma)$ be the number of times that $\gamma$ appears in $s$, where $\gamma$ can be either a span or a context.
 111
 112 We begin with
 113 $$P_{\text{SPAN}}(\alpha|c)=\displaystyle\sum_{s\in C}P(\alpha|c,s)P(s)=\frac{1}{|C|}\sum_{s\in C}P(\alpha|c,s)$$
 114 This reduction works for $P_{\text{CONTEXT}}$ as well, so we have reduced the problem of calculating $P_{\text{SPAN}}$ and $P_{\text{CONTEXT}}$ on the corpus as a whole to calculating it for each sentence.
 115
 116 Let us begin by calculating $P_{\text{SPAN}}(\alpha|\text{true},s)$. By Bayes' Theorem, $$P_{\text{SPAN}}(\alpha|\text{true},s)=\frac{P_{\text{SPAN}}(\text{true}|\alpha,s)P_{\text{SPAN}}(\alpha|s)}{P_{\text{SPAN}}(\text{true}|s)}$$
 117 We can compute each of the quantities on the right hand side:
 118 $$P_{\text{SPAN}}(\alpha|s)=\frac{\#_s(\alpha)}{N_s}$$
 119 $$P_{\text{SPAN}}(\text{true}|s)=\frac{2n_s-1}{N_s}$$
 120 $$P_{\text{SPAN}}(\text{true}|\alpha,s)=\frac{1}{\#_s(\alpha)}\displaystyle\sum_{\langle i,j\rangle=\alpha}P_{\text{BRACKET}}(i,j,s)$$
 121
 122 Thus we find that
 123 $$P_{\text{SPAN}}(\alpha|\text{true},s)=\frac{1}{\#_s(\alpha)}\displaystyle\sum_{\langle i,j\rangle=\alpha}P_{\text{BRACKET}}(i,j,s)\frac{\#_s(\alpha)}{N_s}\frac{N_s}{2n_s-1}=\frac{1}{2n_s-1}\displaystyle\sum_{\langle i,j\rangle=\alpha}P_{\text{BRACKET}}(i,j,s)$$
 124
 125 A similar calculation shows that $$P_{\text{CONTEXT}}(\beta|\text{true},s)=\frac{1}{2n_s-1}\displaystyle\sum_{\rangle i,j\langle=\beta}P_{\text{BRACKET}}(i,j,s)$$
 126
 127 Now let us consider $P_{\text{SPAN}}(\alpha|\text{false},s)$. Again by Bayes' Theorem we have $$P_{\text{SPAN}}(\alpha|\text{false},s)=\frac{P_{\text{SPAN}}(\text{false}|\alpha,s)P_{\text{SPAN}}(\alpha|s)}{P_{\text{SPAN}}(\text{false}|s)}$$
 128
 129 $P_{\text{SPAN}}(\alpha|s)$ is as before. $P_{\text{SPAN}}(\text{false}|s)=1-P_{\text{SPAN}}(\text{true}|s)$. And $P_{\text{SPAN}}(\text{false}|\alpha,s)=1-P_{\text{SPAN}}(\text{true}|\alpha,s)$. Thus
 130 $$P_{\text{SPAN}}(\alpha|\text{false},s)=\frac{(1-\frac{1}{\#_s(\alpha)}\displaystyle\sum_{\langle i,j\rangle=\alpha}P_{\text{BRACKET}}(i,j,s))\frac{2n_s-1}{N_s}}{1-\frac{2n_s-1}{N_s}}$$$$=\frac{\#(\alpha)-\Sigma_{<i,j>=\alpha}P_{BRACKET}(i,j,s)}{N_s-2n_s+1}$$
 131
 132 Similarly, we have $$P_{CONTEXT}(\beta|false,s) = \frac{\#(\beta)-\Sigma_{>i,j<=\beta}P_{BRACKET}(i,j,s)}{N_s-2n_s+1}$$
 133
 134 From these, we can calculate the new parameters $P_{\text{SPAN}}$ and $P_{\text{CONTEXT}}$ as above.
 135
 136 \subsection{Initialization}
 137 We initialize the model with a distribution $P_{\text{SPLIT}}$ over trees. $P_{\text{SPLIT}}$ is calculated as follows: for a string of $n$ words, choose a split point at random, then recursively build trees on either side of that split by further random splitting. We initialize the model using $P_{\text{BRACKET}}(i,j,s)=P_{\text{SPLIT}}(i,j,n_s)$. However, we also apply smoothing, adding small constant counts $m_\text{true}$ and $m_\text{false}$ of each span and context as a constituent and distituent, respectively. $m_\text{true}$ should be smaller than $m_\text{false}$, as there are more distituent spans than constituent spans. However, \citet{klein-thesis} and \citet{km-ccm} give different values for these constants, and in any case they must vary in relation to the size of the corpus used. Let $\#_\alpha$ and $\#_\beta$ be the number of unique spans and contexts in the corpus. We initialize the model:
 138 \begin{itemize}
 139 \item $P_{SPAN}(\alpha|true)= \frac {(\displaystyle\sum_{s\in C}\displaystyle\sum_{<i,j>=\alpha}P_{\text{SPLIT}}(i,j,n_s))+m_\text{true}}
 140         {(\displaystyle\sum_{s\in C}2n_s-1)+m_\text{true} \#_\alpha}$
 141 \item $P_{SPAN}(\alpha|false)= \frac {(\displaystyle\sum_{s\in C}\displaystyle\sum_{<i,j>=\alpha}1-P_{\text{SPLIT}}(i,j,n_s))+m_\text{false}}
 142         {(\displaystyle\sum_{s\in C}N_s-(2n_s-1))+m_\text{false} \#_\alpha}$
 143 \item $P_{CONTEXT}(\beta|true)= \frac {(\displaystyle\sum_{s\in C}\displaystyle\sum_{>i,j<=\beta}P_{\text{SPLIT}}(i,j,n_s))+m_\text{true}}
 144         {(\displaystyle\sum_{s\in C}2n_s-1)+m_\text{true} \#_\beta}$
 145 \item $P_{CONTEXT}(\beta|false)= \frac {(\displaystyle\sum_{s\in C}\displaystyle\sum_{>i,j<=\beta}1-P_{\text{SPLIT}}(i,j,s))+m_\text{false}}
 146         {(\displaystyle\sum_{s\in C}N_s-(2n_s-1))+m_\text{false} \#_\beta}$
 147 \end{itemize}
 148
 149 \subsection{Results}
 150 We have not yet tested the CCM model on the Wall Street Journal corpus.
 151
 152 \section{A Dependency Model with Valence}
 153 The DMV is a \emph{head-outward process}. Given a certain head (eg. a
 154 verb), we first attach possible arguments in one direction (eg. nouns
 155 to the right), then stop in that direction, then possibly attach
 156 arguments in the other direction, then stop. The probabilities of a
 157 certain head attaching any arguments in a direction sum to one
 158 \citep[see~$P(D(h))$~in][p.~87]{klein-thesis} -- modelled as a PCFG,
 159 however \citep[p.~83,~figure~6.4]{klein-thesis}, probabilities may sum
 160 to less than one given a certain left-hand side symbol.
 161
 162 In the following sections we try to flesh out the details of the
 163 inside-outside algorithm \citep{lari-csl90} applied to the DMV model
 164 of \citet{klein-thesis}. We have three parameters which are
 165 reestimated using the inside-outside algorithm:
 166
 167 \begin{itemize}
 168 \item $P_{ATTACH}(a|h,dir)$ is the probability of attaching an
 169   argument with tag $a$, given that we are attaching in direction
 170   $dir$ from head $h$. Here $a,h \in \mathbb{T},
 171   dir\in\{left,right\}$.
 172 \item $P_{STOP}(stop|h,dir,adj)$ is the probability of stopping (which
 173   happens after argument attachment), given that we are attaching from
 174   head $h$ in direction $dir$. Adjacency $adj$ is whether or not we
 175   have attached anything yet in that direction.
 176 \item $P_{ROOT}(h)$ is the probability of the special ROOT symbol
 177   heading $h$; in the original model this is modelled using a right
 178   stop, a left stop, and an attachment
 179   \citep[p.~105,~figure~A.2]{klein-thesis}; we conflated these
 180   into one additional parameter.
 181 \end{itemize}
 182
 183
 184
 185
 186
 187 \newcommand{\LOC}[1]{\textbf{#1}}
 188 \newcommand{\GOR}[1]{\overrightarrow{#1}}
 189 \newcommand{\RGOL}[1]{\overleftarrow{\overrightarrow{#1}}}
 190 \newcommand{\SEAL}[1]{\overline{#1}}
 191 \newcommand{\LGOR}[1]{\overrightarrow{\overleftarrow{#1}}}
 192 \newcommand{\GOL}[1]{\overleftarrow{#1}}
 193 \newcommand{\LN}[1]{\underleftarrow{#1}}
 194 \newcommand{\RN}[1]{\underrightarrow{#1}}
 195 \newcommand{\XI}{\lessdot}
 196 \newcommand{\XJ}{\gtrdot}
 197 \newcommand{\SMTR}[1]{\dddot{#1}}
 198 \newcommand{\SDTR}[1]{\ddot{#1}}
 199
 200 \subsection{Note on notation}
 201 $i, j, k$ are sentence positions (between words), where $i$ and $j$
 202 are always the start and end, respectively, for what we're calculating
 203 ($k$ is between $i$ and $j$ for $P_{INSIDE}$, to their right or left
 204 for $P_{OUTSIDE}$). $s \in S$ are sentences in the corpus. $\LOC{w}$
 205 is a word token (actually POS-token) of type $w$ at a certain sentence
 206 location. If $\LOC{w}$ is between $i$ and $i+1$, $loc(\LOC{w})=i$
 207 following \citet{klein-thesis}, meaning $i$ is adjacent to $\LOC{w}$
 208 on the left, while $j=loc(\LOC{w})+1$ means that $j$ is adjacent to
 209 $\LOC{w}$ on the right. To simplify, $loc_l(\LOC{w}):=loc(\LOC{w})$ and
 210 $loc_r(\LOC{w}):=loc(\LOC{w})+1$. We write $\LOC{h}$ if this is a head
 211 in the rule being used, $\LOC{a}$ if it is an attached argument.
 212
 213 There are som notational differences between the thesis
 214 \citet{klein-thesis} and the ACL paper \citet{km-dmv}:
 215
 216 \begin{tabular}{cc}
 217 Paper: & Thesis: \\
 218 $w$ & $\GOR{w}$ \\
 219 $w\urcorner$ & $\RGOL{w}$ \\
 220 $\ulcorner{w}\urcorner$ & $\SEAL{w}$ \\
 221 \end{tabular}
 222
 223 We use $\SMTR{w}$ (or $\SDTR{w}$) to signify one of either $w, \GOR{w},
 224 \RGOL{w}, \LGOR{w}, \GOL{w}$ or $\SEAL{w}$\footnote{This means that
 225   $\SMTR{\LOC{w}}$ is the triplet of the actual POS-tag, its sentence
 226   location as a token, and the ``level of seals''.}.
 227
 228
 229 \subsection{Inside probabilities}
 230 $P_{INSIDE}$ is defined in \citet[pp.~106-108]{klein-thesis}, the only
 231 thing we need to add is that for right attachments,
 232 $i \leq loc_l(w)<k \leq loc_l(\LOC{a})<j$ while for left attachments,
 233 $i \leq loc_l(\LOC{a})<k \leq loc_l(w)<j$.
 234
 235 (For now, let
 236 \[ \forall{w}[P_{ORDER}(right\text{-}first|w)=1.0] \] since the DMV implementation
 237 is not yet generalized to both directions.)
 238
 239
 240
 241
 242
 243 \subsubsection{Sentence probability}
 244
 245 $P_s$ is the sentence probability, based on
 246 \citet[p.~38]{lari-csl90}. Since the ROOT rules are different from the
 247 rest, we sum them explicitly in this definition:
 248 \begin{align*}
 249   P_s = \sum_{\LOC{w} \in s} P_{ROOT}(\LOC{w}) P_{INSIDE}(\SEAL{\LOC{w}}, 0, len(s))
 250 \end{align*}
 251
 252 \subsection{Outside probabilities}
 253
 254 \begin{align*}
 255   P_{OUTSIDE_s}(ROOT, i, j) = \begin{cases}
 256     1.0 & \text{ if $i = 0$ and $j = len(s)$,}\\
 257     0.0 & \text{ otherwise}
 258   \end{cases}
 259 \end{align*}
 260
 261 For $P_{OUTSIDE}(\SEAL{w}, i, j)$, $w$ is attached to under something
 262 else ($\SEAL{w}$ is what we elsewhere call $\SEAL{a}$). Adjacency is
 263 thus calculated on the basis of $h$, the head of the rule. If we are
 264 attached to from the left we have $i \leq  loc_l(\LOC{w}) < j \leq  loc_l(\LOC{h}) < k$, while
 265 from the right we have $k \leq  loc_l(\LOC{h}) < i \leq  loc_l(\LOC{w}) < j$:
 266 \begin{align*}
 267   P_{OUTSIDE}&(\SEAL{\LOC{w}}, i, j) = \\
 268   & P_{ROOT}(w) P_{OUTSIDE}(ROOT, i, j) + \\
 269   & [ \sum_{k > j} ~ \sum_{\LOC{h}:j\leq loc_l(\LOC{h})<k} \sum_{\SMTR{\LOC{h}} \in \{\RGOL{\LOC{h}},\GOL{\LOC{h}}\}} P_{STOP}(\neg stop|h, left,  adj(j, \LOC{h})) P_{ATTACH}(w|h, left) \\
 270   & \qquad \qquad \qquad \qquad \qquad P_{OUTSIDE}(\SMTR{\LOC{h}}, i, k) P_{INSIDE}(\SMTR{\LOC{h}}, j, k) ] ~ + \\
 271   & [ \sum_{k < i} ~ \sum_{\LOC{h}:k\leq loc_l(\LOC{h})<i} \sum_{\SMTR{\LOC{h}} \in \{\LGOR{\LOC{h}},\GOR{\LOC{h}}\}} P_{STOP}(\neg stop|h, right, adj(i, \LOC{h})) P_{ATTACH}(w|h, right) \\
 272   & \qquad \qquad \qquad \qquad \qquad P_{INSIDE}(\SMTR{\LOC{h}}, k, i) P_{OUTSIDE}(\SMTR{\LOC{h}}, k, j) ]
 273 \end{align*}
 274
 275 For $\RGOL{w}$ we know it is either under a left stop rule or it is
 276 the right daughter of a left attachment rule ($k \leq loc_l(\LOC{a}) <
 277 i \leq loc_l(\LOC{w}) < j$), and these are adjacent if the start point
 278 ($i$) equals $loc_l(\LOC{w})$:
 279 \begin{align*}
 280   P_{OUTSIDE}(\RGOL{\LOC{w}}, i, j) = & P_{STOP}(stop|w, left, adj(i,
 281   \LOC{w}))P_{OUTSIDE}(\SEAL{\LOC{w}}, i, j) ~ + \\
 282   & [ \sum_{k < i} ~ \sum_{\LOC{a}:k\leq loc_l(\LOC{a})<i} P_{STOP}(\neg stop|w, left, adj(i, \LOC{w})) P_{ATTACH}(a|w, left) \\
 283   & ~~~~~~~~~~~~~~~~~~~~~~~~~ P_{INSIDE}(\SEAL{\LOC{a}}, k, i) P_{OUTSIDE}(\RGOL{\LOC{w}}, k, j) ]
 284 \end{align*}
 285
 286 For $\GOR{w}$ we are either under a right stop or the left daughter of
 287 a right attachment rule ($i \leq loc_l(\LOC{w}) < j \leq
 288 loc_l(\LOC{a}) < k$), adjacent iff the the end point ($j$) equals
 289 $loc_r(\LOC{w})$:
 290 \begin{align*}
 291   P_{OUTSIDE}(\GOR{\LOC{w}}, i, j) = & P_{STOP}(stop|w, right, adj(j,
 292   \LOC{w}))P_{OUTSIDE}(\RGOL{\LOC{w}}, i, j) ~ + \\
 293   & [ \sum_{k > j} ~ \sum_{\LOC{a}:j\leq loc_l(\LOC{a})<k} P_{STOP}(\neg stop|w, right, adj(j, \LOC{w})) P_{ATTACH}(a|w, right) \\
 294   & ~~~~~~~~~~~~~~~~~~~~~~~~~ P_{OUTSIDE}(\GOR{\LOC{w}}, i, k) P_{INSIDE}(\SEAL{\LOC{a}}, j, k) ]
 295 \end{align*}
 296
 297 $\GOL{w}$ is just like $\RGOL{w}$, except for the outside probability
 298 of having a stop above, where we use $\LGOR{w}$:
 299 \begin{align*}
 300   P_{OUTSIDE}(\GOL{\LOC{w}}, i, j) = & P_{STOP}(stop|w, left, adj(i,
 301   \LOC{w}))P_{OUTSIDE}(\LGOR{\LOC{w}}, i, j) ~ + \\
 302   & [ \sum_{k < i} ~ \sum_{\LOC{a}:k\leq loc_l(\LOC{a})<i} P_{STOP}(\neg stop|w, left, adj(i, \LOC{w})) P_{ATTACH}(a|w, left) \\
 303   & ~~~~~~~~~~~~~~~~~~~~~~~~~ P_{INSIDE}(\SEAL{\LOC{a}}, k, i) P_{OUTSIDE}(\GOL{\LOC{w}}, k, j) ]
 304 \end{align*}
 305
 306 $\LGOR{w}$ is just like $\GOR{w}$, except for the outside probability
 307 of having a stop above, where we use $\SEAL{w}$:
 308 \begin{align*}
 309   P_{OUTSIDE}(\LGOR{\LOC{w}}, i, j) = & P_{STOP}(stop|w, right, adj(j,
 310   \LOC{w}))P_{OUTSIDE}(\SEAL{\LOC{w}}, i, j) ~ + \\
 311   & [ \sum_{k > j} ~ \sum_{\LOC{a}:j\leq loc_l(\LOC{a})<k} P_{STOP}(\neg stop|w, right, adj(j, \LOC{w})) P_{ATTACH}(a|w, right) \\
 312   & ~~~~~~~~~~~~~~~~~~~~~~~~~ P_{OUTSIDE}(\LGOR{\LOC{w}}, i, k) P_{INSIDE}(\SEAL{\LOC{a}}, j, k) ]
 313 \end{align*}
 314
 315
 316 \subsection{Reestimating the rules}
 317
 318 \subsubsection{$c$ and $w$ (helper formulas used below)}
 319 $c_s(\SMTR{\LOC{w}} : i, j)$ is ``the expected fraction of parses of
 320 $s$ with a node labeled $\SMTR{w}$ extending from position $i$ to
 321 position $j$'' \citep[p.~88]{klein-thesis}, here defined to equal
 322 $v_{q}$ of \citet[p.~41]{lari-csl90}\footnote{In terms of regular EM,
 323   this is the count of trees ($f_{T_q}(x)$ in
 324   \citet[p.~46]{prescher-em}) in which the node extended from $i$ to
 325   $j$.}:
 326 \begin{align*}
 327   c_s(\SMTR{\LOC{w}} : i, j) = P_{INSIDE_s}(\SMTR{\LOC{w}}, i, j) P_{OUTSIDE_s}(\SMTR{\LOC{w}}, i, j) / P_s
 328 \end{align*}
 329
 330 $w_s$ is $w_{q}$ from \citet[p.~41]{lari-csl90}, generalized to $\SMTR{h}$ and $dir$:
 331 \begin{align*}
 332   w_s(\SEAL{a} & : \SMTR{\LOC{h}}, left, i, j) = \\
 333   & 1/P_s \sum_{k:i<k<j} ~ \sum_{\LOC{a}:i\leq loc_l(\LOC{a})<k}
 334           & P_{STOP}(\neg stop|h, left, adj(k, \LOC{h})) P_{ATTACH}(a|h, left) \\
 335   &       & P_{INSIDE_s}(\SEAL{\LOC{a}}, i, k) P_{INSIDE_s}(\SMTR{\LOC{h}}, k, j) P_{OUTSIDE_s}(\SMTR{\LOC{h}}, i, j)
 336 \end{align*}
 337 \begin{align*}
 338   w_s(\SEAL{a} & : \SMTR{\LOC{h}}, right,  i, j) = \\
 339   & 1/P_s \sum_{k:i<k<j} ~ \sum_{\LOC{a}:k\leq loc_l(\LOC{a})<j}
 340           & P_{STOP}(\neg stop|h, right, adj(k, \LOC{h})) P_{ATTACH}(a|h, right) \\
 341   &       & P_{INSIDE_s}(\SMTR{\LOC{h}}, i, k) P_{INSIDE_s}(\SEAL{\LOC{a}}, k, j) P_{OUTSIDE_s}(\SMTR{\LOC{h}}, i, j)
 342 \end{align*}
 343
 344 Let $\hat{P}$ be the new STOP/ATTACH-probabilities (since the old $P$
 345 are used in $P_{INSIDE}$ and $P_{OUTSIDE}$).
 346
 347 \subsubsection{Attachment reestimation}
 348
 349 This is based on $\hat{a}$, given in \citet[p.~41]{lari-csl90}. Here
 350 $i<loc_l(\LOC{h})$ since we want trees with at least one attachment:
 351 \begin{align*}
 352   \hat{P}_{ATTACH} (a | h, left) = \frac
 353   { \sum_{s \in S} \sum_{\LOC{h} \in s} \sum_{i<loc_l(\LOC{h})} \sum_{j\geq loc_r(\LOC{h})} [w_s(\SEAL{a} : \GOL{\LOC{h}}, left, i, j) + w_s(\SEAL{a} : \RGOL{\LOC{h}}, left, i, j) ]}
 354   { \sum_{s \in S} \sum_{\LOC{h} \in s} \sum_{i<loc_l(\LOC{h})} \sum_{j\geq loc_r(\LOC{h})} [c_s(\GOL{\LOC{h}} : i, j) + c_s(\RGOL{\LOC{h}} : i, j) ]}
 355 \end{align*}
 356
 357 Here $j>loc_r(\LOC{h})$ since we want at least one attachment:
 358 \begin{align*}
 359   \hat{P}_{ATTACH} (a | h, right) = \frac
 360   { \sum_{s \in S} \sum_{\LOC{h} \in s} \sum_{i\leq loc_l(\LOC{h})} \sum_{j>loc_r(\LOC{h})} [w_s(\SEAL{a} : \LGOR{\LOC{h}}, right, i, j) + w_s(\SEAL{a} : \GOR{\LOC{h}}, right, i, j) ]}
 361   { \sum_{s \in S} \sum_{\LOC{h} \in s} \sum_{i\leq loc_l(\LOC{h})} \sum_{j>loc_r(\LOC{h})} [c_s(\LGOR{\LOC{h}} : i, j) + c_s(\GOR{\LOC{h}} : i, j) ]}
 362 \end{align*}
 363
 364 For the first/lowest attachments, $w_s$ and $c_s$ have zero probability
 365 where $i<loc_l(\LOC{h})$ (for $\GOR{h}$) or $j>loc_r(\LOC{h})$ (for $\GOL{h}$),
 366 this is implicit in $P_{INSIDE}$.
 367
 368
 369
 370 \subsubsection{Stop reestimation}
 371 The following is based on \citet[p.~88]{klein-thesis}. For the
 372 non-adjacent rules, $i<loc_l(\LOC{h})$ on the left and
 373 $j>loc_r(\LOC{h})$ on the right, while for the adjacent rules these
 374 are equal (respectively)\footnote{For left-stopping with right-first
 375   attachments, $j \geq loc_r(\LOC{h})$ since we may have
 376   right-attachments below, for left-stopping with left-first
 377   attachments we only need $j=loc_r(\LOC{h})$.}.
 378
 379 To avoid some redundancy below, define a helper function $\hat{d}$ as follows:
 380 \begin{align*}
 381   \hat{d}(\SMTR{h},\XI,\XJ) =
 382   { \sum_{s \in S} \sum_{\SMTR{\LOC{h}}:\LOC{h} \in s} \sum_{i:i \XI loc_l(\LOC{h})} \sum_{j:j \XJ loc_r(\LOC{h})} c_s(\SMTR{\LOC{h}} : i, j) }
 383 \end{align*}
 384
 385 Then these are our reestimated stop probabilities:
 386 \begin{align*}
 387   \hat{P}_{STOP} (STOP|h, left, non\text{-}adj) = \frac
 388   { \hat{d}(\SEAL{h},<,\geq) + \hat{d}(\LGOR{h},<,=) }
 389   { \hat{d}(\RGOL{h},<,\geq) + \hat{d}(\GOL{h},<,=) }
 390 \end{align*}
 391
 392 \begin{align*}
 393   \hat{P}_{STOP} (STOP|h, left, adj) = \frac
 394   { \hat{d}(\SEAL{h},=,\geq) + \hat{d}(\LGOR{h},=,=) }
 395   { \hat{d}(\RGOL{h},=,\geq) + \hat{d}(\GOL{h},=,=) }
 396 \end{align*}
 397
 398 \begin{align*}
 399   \hat{P}_{STOP} (STOP|h, right, non\text{-}adj) = \frac
 400   { \hat{d}(\RGOL{h},=,>)  + \hat{d}(\SEAL{h},\leq,>) }
 401   { \hat{d}(\GOR{h},=,>)  + \hat{d}(\LGOR{h},\leq,>) }
 402 \end{align*}
 403
 404 \begin{align*}
 405   \hat{P}_{STOP} (STOP|h, right, adj) = \frac
 406   { \hat{d}(\RGOL{h},=,=)  + \hat{d}(\SEAL{h},\leq,=) }
 407   { \hat{d}(\GOR{h},=,=)  + \hat{d}(\LGOR{h},\leq,=) }
 408 \end{align*}
 409
 410
 411 \subsubsection{Root reestimation}
 412 Following \citet[p.~46]{prescher-em}, to find the reestimated
 413 probability of a PCFG rule, we first find the new treebank frequencies
 414 $f_{T_P}(tree)=P(tree)/P_s$, then for a rule $X' \rightarrow X$ we
 415 divide the new frequencies of the trees which use this rule and by
 416 those of the trees containing the node $X'$. $ROOT$ appears once per
 417 tree, meaning we divide by $1$ per sentence\footnote{Assuming each
 418   tree has frequency $1$.}, so $\hat{P}_{ROOT}(h)=\sum_{tree:ROOT
 419   \rightarrow \SEAL{h} \text{ used in } tree} f_{T_P}(tree)=\sum_{tree:ROOT
 420   \rightarrow \SEAL{h} \text{ used in } tree} P(tree)/P_s$, which turns into:
 421
 422 \begin{align*}
 423   \hat{P}_{ROOT} (h) = \frac
 424   {\sum_{s\in S} 1 / P_s \cdot \sum_{\LOC{h}\in s} P_{ROOT}(\LOC{h}) P_{INSIDE_s}(\SEAL{h}, 0, len(s))}
 425   {\sum_{s\in S} 1}
 426 \end{align*}
 427
 428
 429
 430
 431 \subsection{Alternate CNF-like rules}
 432 Since the IO algorithm as described in \citet{lari-csl90} is made for
 433 rules in Chomsky Normal Form (CNF), we have an alternate grammar
 434 (table \ref{tab:cnf-like}) for running tests, where we don't have to sum
 435 over the different $loc(h)$ in IO. This is not yet generalized to
 436 include left-first attachment. It is also not quite CNF, since it
 437 includes some unary rewrite rules.
 438
 439 \begin{table*}[ht]
 440   \centering
 441   \begin{tabular} % four left-aligned math tabs, one vertical line
 442     { >{$}l<{$} >{$}l<{$} >{$}l<{$} | >{$}l<{$} }
 443     \multicolumn{3}{c}{Rule} & \multicolumn{1}{c}{$P_{RULE}$ ($a[i,j,k]$ in \citet{lari-csl90})}\\
 444     \hline{}
 445
 446     \RN{\GOR{h}} \rightarrow& \GOR{h} &\SEAL{a}        &P_{STOP}(\neg stop|h, right, adj) \cdot P_{ATTACH}(a|h, right) \\
 447     &&&\\
 448     \RN{\GOR{h}} \rightarrow& \RN{\GOR{h}} &\SEAL{a}   &P_{STOP}(\neg stop|h, right, non\text{-}adj) \cdot P_{ATTACH}(a|h, right) \\
 449     &&&\\
 450     \RGOL{h} \rightarrow& \GOR{h} &STOP       &P_{STOP}(stop|h, right, adj) \\
 451     &&&\\
 452     \RGOL{h} \rightarrow& \RN{\GOR{h}} &STOP  &P_{STOP}(stop|h, right, non\text{-}adj) \\
 453     &&&\\
 454     \LN{\RGOL{h}} \rightarrow& \SEAL{a} &\RGOL{h}      &P_{STOP}(\neg stop|h, left, adj) \cdot P_{ATTACH}(a|h, left) \\
 455     &&&\\
 456     \LN{\RGOL{h}} \rightarrow& \SEAL{a} &\LN{\RGOL{h}} &P_{STOP}(\neg stop|h, left, non\text{-}adj) \cdot P_{ATTACH}(a|h, left) \\
 457     &&&\\
 458     \SEAL{h} \rightarrow& STOP &\RGOL{h}      &P_{STOP}(stop|h, left, adj) \\
 459     &&&\\
 460     \SEAL{h} \rightarrow& STOP &\LN{\RGOL{h}} &P_{STOP}(stop|h, left, non\text{-}adj) \\
 461   \end{tabular}
 462   \caption{Alternate CFG rules (where a child node has an arrow below,
 463     we use non-adjacent probabilities), defined for all words/POS-tags
 464     $h$.}\label{tab:cnf-like}
 465 \end{table*}
 466
 467 The inside probabilities are the same as those given in
 468 \citet{lari-csl90}, with the following exceptions:
 469
 470 When calculating $P_{INSIDE}(\SMTR{h}, i, j)$ and summing through
 471 possible rules which rewrite $\SMTR{h}$, if a rule is of the form
 472 $\SMTR{h} \rightarrow STOP ~ \SDTR{h}$ or $\SMTR{h} \rightarrow
 473 \SDTR{h} ~ STOP$, we add $P_{RULE}\cdot P_{INSIDE}(\SDTR{h}, i, j)$
 474 (that is, rewrite for the same sentence range); and, as a consequence
 475 of these unary rules: for ``terminal rules'' ($P_{ORDER}$) to be
 476 applicable, not only must $i = j-1$, but also the left-hand side
 477 symbol of the rule must be of the form $\GOR{h}$.
 478
 479 Similarly, the outside probabilities are the same as those for pure
 480 CNF rules, with the exception that we add the unary rewrite
 481 probabilities
 482 \begin{align*}
 483   \sum_{\SMTR{h}} [&P_{OUTSIDE}(\SMTR{h},i,j)\cdot P_{RULE}(\SMTR{h} \rightarrow \SDTR{h} ~ STOP) \\
 484           + &P_{OUTSIDE}(\SMTR{h},i,j)\cdot P_{RULE}(\SMTR{h} \rightarrow STOP ~ \SDTR{h})]
 485 \end{align*}
 486 to $P_{OUTSIDE}(\SDTR{h},i,j)$ (eg. $f(s,t,i)$).
 487
 488 This grammar gave the same results for inside and outside
 489 probabilities when run over our corpus.
 490
 491 \subsection{Initialization}
 492 \citet{klein-thesis} describes DMV initialization using a ``harmonic
 493 distribution'' for the initial probabilities, where the probability of
 494 one word heading another is higher if they appear closer to one
 495 another.
 496
 497 There are several ways this could be implemented. We initialized attachment
 498 probabilities with the following formula:
 499
 500 \begin{align*}
 501   P_{ATTACH}(a|h,right) = \frac
 502   {\sum_{s \in S}\sum_{\LOC{h} \in s} \sum_{\LOC{a} \in s:loc(\LOC{a})>loc(\LOC{h})} 1/(loc(\LOC{a})-loc(\LOC{h})) + C_A}
 503   {\sum_{s \in S}\sum_{\LOC{h} \in s} \sum_{\LOC{w} \in s:loc(\LOC{w})>loc(\LOC{h})} 1/(loc(\LOC{w})-loc(\LOC{h})) + C_A}
 504 \end{align*}
 505
 506 The probability of stopping adjacently (left or right) was increased
 507 whenever a word occured at a (left or right) sentence
 508 border\footnote{For non-adjacent stopping we checked for occurence at
 509   the second(-to-last) position.}:
 510
 511 \begin{align*}
 512   f(stop:\LOC{h},left,adj)=\begin{cases}
 513     C_S \text{, if } loc(\LOC{h}) = 0,\\
 514     0 \text{, otherwise}
 515   \end{cases}
 516 \end{align*}
 517
 518 \begin{align*}
 519   P_{STOP}(stop|h,left,adj) = \frac
 520   {C_{M} + \sum_{s \in S}\sum_{\LOC{h} \in s} f(stop:\LOC{h},left,adj)}
 521   {C_{M} + \sum_{s \in S}\sum_{\LOC{h} \in s} C_S+C_N}
 522 \end{align*}
 523
 524 Tweaking the initialization constants $C_A, C_M, C_S$ and $C_N$
 525 allowed us to easily try different inital distributions (eg. to modify
 526 the initial grammar in the direction of a uniform distribution). The
 527 next section discusses the effect of these.
 528
 529 \subsection{TODO: Results}
 530 We compared the results of the implementation with a dependency parsed
 531 version of the WSJ-10 corpus (converted from the manually annotated
 532 version). Since single word sentences were not POS-tagged there, these
 533 were skipped. Also, the dependency parsed WSJ-10 did not have ROOT
 534 nodes; so we calculated precision and recall both without counting our
 535 ROOT links, and with counting ROOT links, by adding these to the gold
 536 parses where possible\footnote{221 parses in the dependency parsed
 537   WSJ-10 had several tokens appearing as heads without appearing as
 538   dependents in the same parse, here we skipped the parses when
 539   calculating with ROOT links.}.
 540
 541 \begin{table*}[hb]
 542   \centering
 543   \begin{tabular}{l|ccc}
 544     Model                         & P             & R              & F1          \\
 545     \hline
 546     LBRANCH/RHEAD                 & 25.6          & 32.6           & 28.7        \\
 547     RANDOM                        & 31.0          & 39.4           & 34.7        \\
 548     RBRANCH/LHEAD                 & 55.1          & 70.0           & 61.7        \\
 549     K\&M's DMV                    & 46.6          & 59.2           & 52.1        \\
 550     Our DMV:                                                                     \\
 551     $C_A=0; C_S=1;C_N=0.1;C_M=1$    & 23.7 (23.7)   & 24.8 (24.5)    & 24.2 (24.1) \\
 552     $C_A=0; C_S=1;C_N=0.1;C_M=10$   & 26.7 (31.6)   & 25.5 (30.5)    & 26.1 (31.0) \\
 553     $C_A=10;C_S=1;C_N=0.1;C_M=10$   & 24.5 (29.6)   & 25.6 (30.6)    & 25.0 (30.1) \\
 554     $C_A=10;C_S=1;C_N=3  ;C_M=10$   & 24.9 (30.0)   & 26.0 (31.0)    & 25.5 (30.5) \\
 555     $C_A=15;C_S=3;C_N=1  ;C_M=20$   & 25.6 (30.6)   & 26.7 (31.7)    & 26.2 (31.2) \\
 556   \end{tabular}
 557   \caption{DMV results on the WSJ-10 for various initialization values (numbers in parentheses are when counting added ROOT links)}
 558   \label{tab:dmv-wsj}
 559 \end{table*}
 560
 561 We tried various values for the initialization constants; but it was
 562 hard to find any clear pattern for what worked best.
 563 %$C_N$ values of 10 or less worked well,
 564 % todo!!
 565
 566 Table \ref{tab:dmv-wsj} shows the results of running our
 567 implementation on the WSJ-10 corpus, compared with the dependency
 568 parsed version, for given values of the initialization constants. The
 569 EM hilltop seemed to be reached after about 30 iterations for our best
 570 results (sooner for the worse results). As the table indicates
 571 (especially the numbers for ``rooted'' evaluation), different initial
 572 settings can lead to very different results\footnote{One alternative
 573   idea we still haven't tried is basing initial DMV stopping
 574   probabilities on \emph{distance} from sentence start/end, instead of
 575   the ``binary'' test given above.}.
 576
 577
 578 The numbers are not completely comparable though, since our corpus was
 579 6268 sentences, while \citet{km-dmv} report 7422 sentences; also it is
 580 not clear whether their DMV-experiment was run using automatically
 581 induced word classes \citep[Schütze, 1995, in][p.~8]{km-dmv} or on the
 582 tagset used to manually annotate the WSJ-10.
 583
 584 In the stochastic grammars given by the model after EM, the POS class
 585 $VBD$ (past tense verb) had the highest probability of being attached
 586 to by ROOT, followed by $VBZ$ (3sg present) and $VBP$ (non-3sg
 587 present); which looks promising. Interestingly, on fourth place we see
 588 the tag $:$, and on sixth, $``$. Results might have been different
 589 with such tokens stripped from the corpus and the evaluation standard,
 590 we haven't tried this yet\footnote{A ROOT attachment to $:$ does make
 591   sense though, in the same way that a comma might be the head of a
 592   conjunction. Our evaluation standard almost always
 593   % 97 per cent of the time..
 594   puts colons as dependents of the phrase head (often an NP
 595   fragment).}.
 596
 597 Just like \citet[p.~89]{klein-thesis} our algorithm chose determiners
 598 to be the heads of nouns, rather than the opposite.
 599 % Also, in general
 600 % non-adjacent stopping was a lot more likely than adjacent stopping, a
 601 % situation given by the DMV model.
 602
 603 % \section{The combined model (?)}
 604 % \subsection{Results (?)}
 605
 606
 607 \section{Conclusion}
 608 From our implementations it seems that these processes capture some of
 609 the statistical regularities in language, eg. the strong likelihood of
 610 verbal sentence heads; however, our evaluations show that we did not
 611 manage to replicate Klein \& Manning's parsing results, let alone
 612 improve on them.
 613
 614
 615 \nocite{lari-csl90}
 616 \nocite{klein-thesis}
 617 \nocite{km-dmv}
 618 \bibliography{./statistical.bib}
 619 \bibliographystyle{plainnat}
 620
 621 \end{document}
 622
 623
 624