report/report.tex

   1 % merged Kevin's & Emily's reports 2008-09-21 Sun 19:04 with Emerge
   2
   3 \documentclass[11pt,a4paper]{article}
   4 \usepackage[utf8]{inputenc}
   5 \usepackage[T1]{fontenc}
   6 \usepackage{hyperref}
   7 \usepackage{natbib}
   8
   9 \usepackage{pslatex}
  10 \usepackage{pdfsync}
  11 \pdfoutput=1
  12
  13 \usepackage{qtree}
  14 \usepackage{amsmath}
  15 \usepackage{amssymb}
  16
  17 \usepackage{avm}
  18 \avmfont{\sc}
  19 \avmoptions{sorted,active}
  20 \avmvalfont{\rm}
  21 \avmsortfont{\scriptsize\it}
  22
  23 \usepackage{array} % for a tabular with math in it
  24
  25 \title{The DMV and CCM models}
  26 \author{Emily Morgan \& Kevin Brubeck Unhammer}
  27 \date{15 September 2008}
  28
  29 \begin{document}
  30
  31 \maketitle
  32
  33 \tableofcontents
  34
  35 \section{Introduction}
  36 \citet{klein-thesis} describes two models for the unsupervised
  37 learning of syntactic structure. One is the Constituent Context Model
  38 (CCM), which uses the context as well as the content of a substring to determine whether it
  39 is a constituent. The context of a substring is defined to be the words
  40 immediately preceding and following the substring. The intuition is that context can be a reliable cue for constituentness because when a long phrase is a constituent, there is often a similar case where a single word will serve as a constituent in the same context. Thus to determine the
  41 likelihood that a substring is a constituent, this model considers both
  42 the likelihood of the substring itself as a constituent and the
  43 likelihood of its context as the context of a constituent.
  44
  45 The other is a Dependency Model with Valence (DMV), which uses a form
  46 of Dependency Grammar. Here sentence structure is represented as pairs
  47 of heads and dependents (arguments), although it is also possible to
  48 represent this as a CFG (see \citet[p.~106]{klein-thesis}).
  49
  50 Apart from identity of the argument, probabilities are conditional on
  51 direction of attachment. A valence condition is introduced by using
  52 stopping probabilities: a head may take zero, one or more dependents
  53 before ``stopping'', and there are different probabilities for
  54 stopping without taking arguments (\emph{adjacently}) versus after
  55 taking one or more (\emph{non-adjacently}).
  56
  57 Versions of the inside-outside algorithm for PCFG's \citep{lari-csl90}
  58 are used for running estimation maximation on both these probabilistic
  59 models. We tried implementing both the CCM and the DMV. Below we give
  60 the details of how this was done, along with the results.
  61
  62 \section{The Constituent Context Model}
  63 \subsection{Basics}
  64 The CCM model \citep{klein-thesis, km-ccm} is a generative model over a set of Part Of Speech (POS) tags $\mathbb{T}$ and a beginning/end of sentence symbol $\diamond$.
  65 The CCM model has two parameters:
  66 \begin{itemize}
  67 \item $P_{SPAN}(\alpha | c)$ is the probability of generating a span $\alpha$ given that it is/is not a constituent; $\alpha \in \mathbb{T}^*, c \in \{\text{true}, \text{false}\}$.
  68 \item $P_{CONTEXT}(\beta | c)$ is the probability of generating a context $\beta$ given that the span it surrounds is/is not a constituent; $\beta \in (\mathbb{T}\cup\diamond)^2,c \in \{\text{true}, \text{false}\}$.
  69 \end{itemize}
  70 The probability of a sentence $s$ with bracketing $B$ is the probability of that bracketing times the probability of the span and the context for each substring of $s$. The probability of bracketings are the fixed distribution $P_\text{bin}$, which assigns equal weight to all bracketings that correspond to binary trees. For a sentence $s$ of length $n$, we index spaces between words in the sentence, including the leading space (0) and the trailing space ($n$). Let $\langle i,j\rangle_s$ be the substring from $i$ to $j$, and $\rangle i,j\langle_s$ be the context of that substring. Then we have: $$P(s,B)=P_{\text{bin}}(B)\displaystyle\prod_{\{\langle i,j\rangle_s\}}P_{\text{SPAN}}(\langle i,j\rangle_s|B_{ij})P_{\text{CONTEXT}}(\rangle i,j\langle_s|B_{ij})$$
  71
  72 \subsection{Reestimation}
  73 We use the EM algorithm to induce our model parameters. Given the parameters $P_{\text{SPAN}}$ and $P_{\text{CONTEXT}}$, we wish to first find the likelihoods of completions for all sentences in a corpus given the current parameter values, and then find new parameters that maximize the corpus likelihood given the previously predicted completions. Because enumerating all bracketings for all sentences in computationally infeasible, we use a dynamic programming algorithm similar to the inside-outside algorithm.
  74
  75 We begin by rewriting the probability $P(s,B)$ given above:
  76 $$P(s,B)=P_{\text{bin}}(B)\displaystyle\prod_{\{\langle i,j\rangle_s\}}P_{\text{SPAN}}(\langle i,j\rangle_s|\text{false})P_{\text{CONTEXT}}(\rangle i,j\langle_s|\text{false}) \times$$$$ \prod_{\{\langle i,j\rangle_s\in B\}}\frac{P_{\text{SPAN}}(\langle i,j\rangle_s|\text{true})P_{\text{CONTEXT}}(\rangle i,j\langle_s|\text{true})}{P_{\text{SPAN}}(\langle i,j\rangle_s|\text{false})P_{\text{CONTEXT}}(\rangle i,j\langle_s|\text{false})}$$
  77 We can then condense this by letting $$\phi(i,j,s)=\frac{P_{\text{SPAN}}(\langle i,j\rangle_s|\text{true})P_{\text{CONTEXT}}(\rangle i,j\langle_s|\text{true})}{P_{\text{SPAN}}(\langle i,j\rangle_s|\text{false})P_{\text{CONTEXT}}(\rangle i,j\langle_s|\text{false})}$$
  78 Further, note that $P_{\text{bin}}$ is constant for all bracketings of a sentence. We can thus also write $$K(s)=P_{\text{bin}}(B)\displaystyle\prod_{\{\langle i,j\rangle_s\}}P_{\text{SPAN}}(\langle i,j\rangle_s|\text{false})P_{\text{CONTEXT}}(\rangle i,j\langle_s|\text{false})$$ and $K(s)$ is constant for a given sentence.
  79 We thus have a condensed formulation of $P(s,B)$:
  80 $$P(s,B)=K(s) \prod_{\{\langle i,j\rangle_s\in B\}}\phi(i,j,s)$$
  81
  82 In the spirit of the inside-outside algorithm, we can use a dynamic programming algorithm to calculate the product of $\phi(i,j,s)$ for a sentence. We wish to calculate the products of the $\phi(i,j,s)$ both inside and outside a given interval $\langle i,j\rangle$. Let $\mathcal{T}(m)$ be the set of binary tree structures with $m$ leaves. We then define $$I(i,j,s)=\displaystyle\sum_{T\in \mathcal{T}(j-i)}\prod_{\langle a,b\rangle:\langle a-i,b-i\rangle\in T}\phi(a,b,s)$$
  83 To calculate I, we want to find a recursive decomposition. In the case where $j=i$, there can be no constituents between $i$ and $j$, so we have $I(i,i,s)=0$. In the case where $j=i+1$, there is necessary exactly one constituent between $i$ and $j$ because all one-word substrings are constituents, so $I(i,i+1,s)=\phi(i,i+1,s)$. In the case where $j>i+1$, any constituent spanning from $i$ to $j$ must also contain subconstituents from $i$ to $k$ and from $k$ to $j$. Thus
  84 $$I(i,j,s)=\displaystyle\sum_{i<k<j}\sum_{T\in\mathcal{T}(k-i)}\sum_{T'\in\mathcal{T}(j-k)}\phi(i,j,s)\prod_{\langle a,b\rangle:\langle a-i,b-i\rangle\in T}\phi(a,b,s)\prod_{\langle a',b'\rangle:\langle a'-k,b'-k\rangle\in T'}\phi(a',b',s)$$
  85 $$=\phi(i,j,s)\displaystyle\sum_{i<k<j}\sum_{T\in\mathcal{T}(k-i)}\sum_{T'\in\mathcal{T}(j-k)}\prod_{\langle a,b\rangle:\langle a-i,b-i\rangle\in T}\phi(a,b,s)\prod_{\langle a',b'\rangle:\langle a'-k,b'-k\rangle\in T'}\phi(a',b',s)$$
  86 $$=\phi(i,j,s)\displaystyle\sum_{i<k<j}I(i,k,s)I(k,j,s)$$
  87
  88 In summary, we have a recursive decomposition for I:
  89 $$I(i,j,s) = \begin{cases}
  90                 \phi(i,j,s)\displaystyle\sum_{i<k<j}I(i,k,s)I(k,j,s) & \text{if } j-i>1 \\
  91                 \phi(i,j,s) & \text{if } j-i=1 \\
  92                 0 & \text{if } j-i = 0
  93         \end{cases}$$
  94
  95 Similarly, we define the outside probability
  96 $$O(i,j,s) = \begin{cases}
  97                 \displaystyle\sum_{0\leq k<i}I(k,i,s)\phi(k,j,s)O(k,j,s) + \sum_{j<k\leq n} I(j,k,s)\phi(i,k,s)O(i,k,s) &
  98                         \text{if } j-i < n \\
  99                 1 & \text{if } j-i = n
 100         \end{cases}$$
 101
 102 To reestimate the parameters $P_{\text{SPAN}}$ and $P_{\text{CONTEXT}}$, we wish to know $P_{\text{BRACKET}}(i,j,s)$, the fraction of bracketings of $s$ that contain $\langle i,j\rangle$ as a constituent:
 103 $$P_{\text{BRACKET}}(i,j,s)=\frac{\displaystyle\sum_{\{B:B_{ij}=\text{true}\}}P(s,B)}{\displaystyle\sum_{B}P(s,B)}$$
 104 We can rewrite this in terms of our inside and outside probabilities:
 105 $$\displaystyle\sum_{\{B:B_{ij}=\text{true}\}}P(s,B)=K(s)I(i,j,s)O(i,j,s)$$
 106 and
 107 $$\displaystyle\sum_{B}P(s,B)=K(s)I(0,n,s)O(0,n,s)$$
 108 But the $K(s)$ terms cancel, and $O(0,n,s)=1$, so we find that $$P_{\text{BRACKET}}(i,j,s)=\frac{I(i,j,s)O(i,j,s)}{I(0,n,s)}$$
 109
 110 From $P_{\text{BRACKET}}$ we can calculate the next iteration of the model parameters $P_{\text{SPAN}}$ and $P_{\text{CONTEXT}}$. First, for a sentence $s$ of length $n_s$, note that the number of spans in the sentence is $N_s=(n_s+1)(n_s+2)/2$, which includes the spans of length 0. Furthemore, let $\#_s(\gamma)$ be the number of times that $\gamma$ appears in $s$, where $\gamma$ can be either a span or a context.
 111
 112 We begin with
 113 $$P_{\text{SPAN}}(\alpha|c)=\displaystyle\sum_{s\in S}P(\alpha|c,s)P(s)=\frac{1}{|S|}\sum_{s\in S}P(\alpha|c,s)$$
 114 This reduction works for $P_{\text{CONTEXT}}$ as well, so we have reduced the problem of calculating $P_{\text{SPAN}}$ and $P_{\text{CONTEXT}}$ on the corpus as a whole to calculating it for each sentence.
 115
 116 Let us begin by calculating $P_{\text{SPAN}}(\alpha|\text{true},s)$. By Bayes' Theorem, $$P_{\text{SPAN}}(\alpha|\text{true},s)=\frac{P_{\text{SPAN}}(\text{true}|\alpha,s)P_{\text{SPAN}}(\alpha|s)}{P_{\text{SPAN}}(\text{true}|s)}$$
 117 We can compute each of the quantities on the right hand side:
 118 $$P_{\text{SPAN}}(\alpha|s)=\frac{\#_s(\alpha)}{N_s}$$
 119 $$P_{\text{SPAN}}(\text{true}|s)=\frac{2n_s-1}{N_s}$$
 120 $$P_{\text{SPAN}}(\text{true}|\alpha,s)=\frac{1}{\#_s(\alpha)}\displaystyle\sum_{\langle i,j\rangle=\alpha}P_{\text{BRACKET}}(i,j,s)$$
 121
 122 Thus we find that
 123 $$P_{\text{SPAN}}(\alpha|\text{true},s)=\frac{1}{\#_s(\alpha)}\displaystyle\sum_{\langle i,j\rangle=\alpha}P_{\text{BRACKET}}(i,j,s)\frac{\#_s(\alpha)}{N_s}\frac{N_s}{2n_s-1}=\frac{1}{2n_s-1}\displaystyle\sum_{\langle i,j\rangle=\alpha}P_{\text{BRACKET}}(i,j,s)$$
 124
 125 A similar calculation shows that $$P_{\text{CONTEXT}}(\beta|\text{true},s)=\frac{1}{2n_s-1}\displaystyle\sum_{\rangle i,j\langle=\beta}P_{\text{BRACKET}}(i,j,s)$$
 126
 127 Now let us consider $P_{\text{SPAN}}(\alpha|\text{false},s)$. Again by Bayes' Theorem we have $$P_{\text{SPAN}}(\alpha|\text{false},s)=\frac{P_{\text{SPAN}}(\text{false}|\alpha,s)P_{\text{SPAN}}(\alpha|s)}{P_{\text{SPAN}}(\text{false}|s)}$$
 128
 129 $P_{\text{SPAN}}(\alpha|s)$ is as before. $P_{\text{SPAN}}(\text{false}|s)=1-P_{\text{SPAN}}(\text{true}|s)$. And $P_{\text{SPAN}}(\text{false}|\alpha,s)=1-P_{\text{SPAN}}(\text{true}|\alpha,s)$. Thus
 130 $$
 131 P_{\text{SPAN}}(\alpha|\text{false},s)=\frac
 132 {(1-\frac
 133   {1}
 134   {\#_s(\alpha)}
 135   \displaystyle\sum_{\langle i,j\rangle=\alpha}
 136   P_{\text{BRACKET}}(i,j,s))\frac
 137   {\#_s(\alpha)}
 138   {N_s}}
 139 {1-\frac{2n_s-1}{N_s}}
 140 $$
 141 $$=\frac
 142 {\#(\alpha)-\Sigma_{<i,j>=\alpha}P_{BRACKET}(i,j,s)}
 143 {N_s-2n_s+1}$$
 144
 145 Similarly, we have $$P_{CONTEXT}(\beta|false,s) = \frac{\#(\beta)-\Sigma_{>i,j<=\beta}P_{BRACKET}(i,j,s)}{N_s-2n_s+1}$$
 146
 147 From these, we can calculate the new parameters $P_{\text{SPAN}}$ and $P_{\text{CONTEXT}}$ as above.
 148
 149 \subsection{Initialization}
 150 We initialize the model with a distribution $P_{\text{SPLIT}}$ over trees. $P_{\text{SPLIT}}$ is calculated as follows: for a string of $n$ words, choose a split point at random, then recursively build trees on either side of that split by further random splitting. We initialize the model using $P_{\text{BRACKET}}(i,j,s)=P_{\text{SPLIT}}(i,j,n_s)$. However, we also apply smoothing, adding small constant counts $m_\text{true}$ and $m_\text{false}$ of each span and context as a constituent and distituent, respectively. $m_\text{true}$ should be smaller than $m_\text{false}$, as there are more distituent spans than constituent spans. However, \citet{klein-thesis} and \citet{km-ccm} give different values for these constants, and in any case they must vary in relation to the size of the corpus used. Let $\#_\alpha$ and $\#_\beta$ be the number of unique spans and contexts in the corpus. We initialize the model:
 151 \begin{itemize}
 152 \item $P_{SPAN}(\alpha|true)= \frac {(\displaystyle\sum_{s\in S}\displaystyle\sum_{<i,j>=\alpha}P_{\text{SPLIT}}(i,j,n_s))+m_\text{true}}
 153         {(\displaystyle\sum_{s\in S}2n_s-1)+m_\text{true} \#_\alpha}$
 154 \item $P_{SPAN}(\alpha|false)= \frac {(\displaystyle\sum_{s\in S}\displaystyle\sum_{<i,j>=\alpha}1-P_{\text{SPLIT}}(i,j,n_s))+m_\text{false}}
 155         {(\displaystyle\sum_{s\in S}N_s-(2n_s-1))+m_\text{false} \#_\alpha}$
 156 \item $P_{CONTEXT}(\beta|true)= \frac {(\displaystyle\sum_{s\in S}\displaystyle\sum_{>i,j<=\beta}P_{\text{SPLIT}}(i,j,n_s))+m_\text{true}}
 157         {(\displaystyle\sum_{s\in S}2n_s-1)+m_\text{true} \#_\beta}$
 158 \item $P_{CONTEXT}(\beta|false)= \frac {(\displaystyle\sum_{s\in S}\displaystyle\sum_{>i,j<=\beta}1-P_{\text{SPLIT}}(i,j,s))+m_\text{false}}
 159         {(\displaystyle\sum_{s\in S}N_s-(2n_s-1))+m_\text{false} \#_\beta}$
 160 \end{itemize}
 161
 162 \subsection{Smoothing}
 163 In addition to the smoothing used during initialization, some further smoothing is necessary during reestimation. In our initial calculations, we made the critical step of factoring out $\displaystyle\prod_{\{\langle i,j\rangle_s\}}P_{\text{SPAN}}(\langle i,j\rangle_s|\text{false})P_{\text{CONTEXT}}(\rangle i,j\langle_s|\text{false})$. This factorization assumes that $P_{\text{SPAN}}(\langle i,j\rangle_s|\text{false})$ and $P_{\text{CONTEXT}}(\rangle i,j\langle_s|\text{false})$ are nonzero for all spans and contexts. Particularly in the case of small corpora, this assumption is not always valid. For example, if a span $\alpha$ always appears as a full sentence--never as a proper substring--then its $P_{\text{SPAN}}(\alpha|\text{false})$ will be zero because a full sentence is never a distituent. Or if a word $a$ only appears as the second word of a sentence, then the context $(\diamond,a)$ will have $P_{\text{CONTEXT}}((\diamond,a)|\text{false})=0$ because the one-word span that it contains will always be a constituent.
 164
 165 To avoid this problem, we can use further smoothing during reestimation. During each reestimation of $P_{\text{SPAN}}$ and $P_{\text{CONTEXT}}$, we add to each span and context small constant counts of $m_\text{true}$ as a constituent and $m_\text{false}$ as a distituent.
 166
 167 \subsection{Results}
 168 Our implementation of the CCM model has not yet been tested on the Wall Street Journal corpus. On small toy corpora, it behaves unpredictably. For instance, on the corpus consisting of the sentences $a b c$ and $b c a$, the model gives higher probabilities to the spans $a b$ and $c a$ as constituents than it does to $b c$. However, this is the opposite of what we would expect. We have not yet been able to identify the source of this problem.
 169
 170 \section{A Dependency Model with Valence}
 171 \newcommand{\LOC}[1]{\textbf{#1}}
 172 \newcommand{\GOR}[1]{\overrightarrow{#1}}
 173 \newcommand{\RGOL}[1]{\overleftarrow{\overrightarrow{#1}}}
 174 \newcommand{\SEAL}[1]{\overline{#1}}
 175 \newcommand{\LGOR}[1]{\overrightarrow{\overleftarrow{#1}}}
 176 \newcommand{\GOL}[1]{\overleftarrow{#1}}
 177 \newcommand{\LN}[1]{\underleftarrow{#1}}
 178 \newcommand{\RN}[1]{\underrightarrow{#1}}
 179 \newcommand{\XI}{\lessdot}
 180 \newcommand{\XJ}{\gtrdot}
 181 \newcommand{\SMTR}[1]{\dddot{#1}}
 182 \newcommand{\SDTR}[1]{\ddot{#1}}
 183
 184 The DMV is a \emph{head-outward process}. Given a certain head (eg. a
 185 verb), we first attach possible arguments in one direction (eg. nouns
 186 to the right), then stop in that direction, then possibly attach
 187 arguments in the other direction, then stop. The probabilities of a
 188 certain head attaching any arguments in a direction sum to one
 189 \citep[see~$P(D(h))$~in][p.~87]{klein-thesis} -- modelled as a PCFG,
 190 however \citep[p.~83,~figure~6.4]{klein-thesis}, probabilities may sum
 191 to less than one given a certain left-hand side symbol.
 192
 193 In the following sections we try to flesh out the details of the
 194 inside-outside algorithm \citep{lari-csl90} applied to the DMV model
 195 of \citet{klein-thesis}. We have three parameters which are
 196 reestimated using the inside-outside algorithm:
 197
 198 \begin{itemize}
 199 \item $P_{ATTACH}(a|h,dir)$ is the probability of attaching an
 200   argument with tag $a$, given that we are attaching in direction
 201   $dir$ from head $h$. Here $a,h \in \mathbb{T},
 202   dir\in\{left,right\}$.
 203 \item $P_{STOP}(stop|h,dir,adj)$ is the probability of stopping (which
 204   happens after argument attachment), given that we are attaching from
 205   head $h$ in direction $dir$. Adjacency $adj$ is whether or not we
 206   have attached anything yet in that direction.
 207 \item $P_{ROOT}(h)$ is the probability of the special ROOT symbol
 208   heading $h$; in the original model this is modelled using a right
 209   stop, a left stop, and an attachment
 210   \citep[p.~105,~figure~A.2]{klein-thesis}; we conflated these
 211   into one additional parameter.
 212 \end{itemize}
 213
 214 There is also a parameter $P_{ORDER}$, signifying the probability of a
 215 head $w$ attaching to the right first, or to the left first. In the
 216 results given below,
 217 \[ \forall{w}[P_{ORDER}(right\text{-}first|w)=1.0] \]
 218 (this parameter is not reestimated).
 219
 220 \subsection{Note on notation}
 221 $i, j, k$ are sentence positions (between words), where $i$ and $j$
 222 are always the start and end, respectively, for what we're calculating
 223 ($k$ is between $i$ and $j$ for $P_{INSIDE}$, to their right or left
 224 for $P_{OUTSIDE}$). $s \in S$ are sentences in the corpus. $\LOC{w}$
 225 is a word token (actually POS-token) of type $w$ at a certain sentence
 226 location. If $\LOC{w}$ is between $i$ and $i+1$, $loc(\LOC{w})=i$
 227 following \citet{klein-thesis}, meaning $i$ is adjacent to $\LOC{w}$
 228 on the left, while $j=loc(\LOC{w})+1$ means that $j$ is adjacent to
 229 $\LOC{w}$ on the right. To simplify, $loc_l(\LOC{w}):=loc(\LOC{w})$ and
 230 $loc_r(\LOC{w}):=loc(\LOC{w})+1$. We write $\LOC{h}$ if this is a head
 231 in the rule being used, $\LOC{a}$ if it is an attached argument.
 232
 233 There are some notational differences between the thesis
 234 \citet{klein-thesis} and the ACL paper \citet{km-dmv}:
 235
 236 \begin{tabular}{cc}
 237 Paper: & Thesis: \\
 238 $w$ & $\GOR{w}$ \\
 239 $w\urcorner$ & $\RGOL{w}$ \\
 240 $\ulcorner{w}\urcorner$ & $\SEAL{w}$ \\
 241 \end{tabular}
 242
 243 We use $\SMTR{w}$ (or $\SDTR{w}$) to signify one of either $w, \GOR{w},
 244 \RGOL{w}, \LGOR{w}, \GOL{w}$ or $\SEAL{w}$\footnote{This means that
 245   $\SMTR{\LOC{w}}$ is the triplet of the actual POS-tag, its sentence
 246   location as a token, and the ``level of seals''.}.
 247
 248
 249 \subsection{Inside probabilities}
 250 $P_{INSIDE}$ is defined in \citet[pp.~106-108]{klein-thesis}, the only
 251 thing we need to add is that for right attachments,
 252 $i \leq loc_l(w)<k \leq loc_l(\LOC{a})<j$ while for left attachments,
 253 $i \leq loc_l(\LOC{a})<k \leq loc_l(w)<j$.
 254
 255
 256 \subsubsection{Sentence probability}
 257 $P_s$ is the sentence probability, based on
 258 \citet[p.~38]{lari-csl90}. Since the ROOT rules are different from the
 259 rest, we sum them explicitly in this definition:
 260 \begin{align*}
 261   P_s = \sum_{\LOC{w} \in s} P_{ROOT}(\LOC{w}) P_{INSIDE}(\SEAL{\LOC{w}}, 0, n_s)
 262 \end{align*}
 263
 264 \subsection{Outside probabilities}
 265
 266 \begin{align*}
 267   P_{OUTSIDE_s}(ROOT, i, j) = \begin{cases}
 268     1.0 & \text{ if $i = 0$ and $j = n_s$,}\\
 269     0.0 & \text{ otherwise}
 270   \end{cases}
 271 \end{align*}
 272
 273 For $P_{OUTSIDE}(\SEAL{w}, i, j)$, $w$ is attached to under something
 274 else ($\SEAL{w}$ is what we elsewhere call $\SEAL{a}$). Adjacency is
 275 thus calculated on the basis of $h$, the head of the rule. If we are
 276 attached to from the left we have $i \leq  loc_l(\LOC{w}) < j \leq  loc_l(\LOC{h}) < k$, while
 277 from the right we have $k \leq  loc_l(\LOC{h}) < i \leq  loc_l(\LOC{w}) < j$:
 278 \begin{align*}
 279   P_{OUTSIDE}&(\SEAL{\LOC{w}}, i, j) = \\
 280   & P_{ROOT}(w) P_{OUTSIDE}(ROOT, i, j) + \\
 281   & [ \sum_{k > j} ~ \sum_{\LOC{h}:j\leq loc_l(\LOC{h})<k} \sum_{\SMTR{\LOC{h}} \in \{\RGOL{\LOC{h}},\GOL{\LOC{h}}\}} P_{STOP}(\neg stop|h, left,  adj(j, \LOC{h})) P_{ATTACH}(w|h, left) \\
 282   & \qquad \qquad \qquad \qquad \qquad P_{OUTSIDE}(\SMTR{\LOC{h}}, i, k) P_{INSIDE}(\SMTR{\LOC{h}}, j, k) ] ~ + \\
 283   & [ \sum_{k < i} ~ \sum_{\LOC{h}:k\leq loc_l(\LOC{h})<i} \sum_{\SMTR{\LOC{h}} \in \{\LGOR{\LOC{h}},\GOR{\LOC{h}}\}} P_{STOP}(\neg stop|h, right, adj(i, \LOC{h})) P_{ATTACH}(w|h, right) \\
 284   & \qquad \qquad \qquad \qquad \qquad P_{INSIDE}(\SMTR{\LOC{h}}, k, i) P_{OUTSIDE}(\SMTR{\LOC{h}}, k, j) ]
 285 \end{align*}
 286
 287 For $\RGOL{w}$ we know it is either under a left stop rule or it is
 288 the right daughter of a left attachment rule ($k \leq loc_l(\LOC{a}) <
 289 i \leq loc_l(\LOC{w}) < j$), and these are adjacent if the start point
 290 ($i$) equals $loc_l(\LOC{w})$:
 291 \begin{align*}
 292   P_{OUTSIDE}(\RGOL{\LOC{w}}, i, j) = & P_{STOP}(stop|w, left, adj(i,
 293   \LOC{w}))P_{OUTSIDE}(\SEAL{\LOC{w}}, i, j) ~ + \\
 294   & [ \sum_{k < i} ~ \sum_{\LOC{a}:k\leq loc_l(\LOC{a})<i} P_{STOP}(\neg stop|w, left, adj(i, \LOC{w})) P_{ATTACH}(a|w, left) \\
 295   & ~~~~~~~~~~~~~~~~~~~~~~~~~ P_{INSIDE}(\SEAL{\LOC{a}}, k, i) P_{OUTSIDE}(\RGOL{\LOC{w}}, k, j) ]
 296 \end{align*}
 297
 298 For $\GOR{w}$ we are either under a right stop or the left daughter of
 299 a right attachment rule ($i \leq loc_l(\LOC{w}) < j \leq
 300 loc_l(\LOC{a}) < k$), adjacent iff the the end point ($j$) equals
 301 $loc_r(\LOC{w})$:
 302 \begin{align*}
 303   P_{OUTSIDE}(\GOR{\LOC{w}}, i, j) = & P_{STOP}(stop|w, right, adj(j,
 304   \LOC{w}))P_{OUTSIDE}(\RGOL{\LOC{w}}, i, j) ~ + \\
 305   & [ \sum_{k > j} ~ \sum_{\LOC{a}:j\leq loc_l(\LOC{a})<k} P_{STOP}(\neg stop|w, right, adj(j, \LOC{w})) P_{ATTACH}(a|w, right) \\
 306   & ~~~~~~~~~~~~~~~~~~~~~~~~~ P_{OUTSIDE}(\GOR{\LOC{w}}, i, k) P_{INSIDE}(\SEAL{\LOC{a}}, j, k) ]
 307 \end{align*}
 308
 309 $\GOL{w}$ is just like $\RGOL{w}$, except for the outside probability
 310 of having a stop above, where we use $\LGOR{w}$:
 311 \begin{align*}
 312   P_{OUTSIDE}(\GOL{\LOC{w}}, i, j) = & P_{STOP}(stop|w, left, adj(i,
 313   \LOC{w}))P_{OUTSIDE}(\LGOR{\LOC{w}}, i, j) ~ + \\
 314   & [ \sum_{k < i} ~ \sum_{\LOC{a}:k\leq loc_l(\LOC{a})<i} P_{STOP}(\neg stop|w, left, adj(i, \LOC{w})) P_{ATTACH}(a|w, left) \\
 315   & ~~~~~~~~~~~~~~~~~~~~~~~~~ P_{INSIDE}(\SEAL{\LOC{a}}, k, i) P_{OUTSIDE}(\GOL{\LOC{w}}, k, j) ]
 316 \end{align*}
 317
 318 $\LGOR{w}$ is just like $\GOR{w}$, except for the outside probability
 319 of having a stop above, where we use $\SEAL{w}$:
 320 \begin{align*}
 321   P_{OUTSIDE}(\LGOR{\LOC{w}}, i, j) = & P_{STOP}(stop|w, right, adj(j,
 322   \LOC{w}))P_{OUTSIDE}(\SEAL{\LOC{w}}, i, j) ~ + \\
 323   & [ \sum_{k > j} ~ \sum_{\LOC{a}:j\leq loc_l(\LOC{a})<k} P_{STOP}(\neg stop|w, right, adj(j, \LOC{w})) P_{ATTACH}(a|w, right) \\
 324   & ~~~~~~~~~~~~~~~~~~~~~~~~~ P_{OUTSIDE}(\LGOR{\LOC{w}}, i, k) P_{INSIDE}(\SEAL{\LOC{a}}, j, k) ]
 325 \end{align*}
 326
 327
 328 \subsection{DMV Reestimation}
 329 $P_{INSIDE}$ and $P_{OUTSIDE}$ (the corpus frequencies given a certain
 330 probability distribution) give us the counts we need to reestimate our
 331 model parameters.
 332
 333 \subsubsection{$c$ and $w$}
 334 First we need some helper functions.
 335 $c_s(\SMTR{\LOC{w}} : i, j)$ is ``the expected fraction of parses of
 336 $s$ with a node labeled $\SMTR{w}$ extending from position $i$ to
 337 position $j$'' \citep[p.~88]{klein-thesis}, here defined to equal
 338 $v_{q}$ of \citet[p.~41]{lari-csl90}\footnote{In terms of regular EM,
 339   this is the count of trees ($f_{T_q}(x)$ in
 340   \citet[p.~46]{prescher-em}) in which the node extended from $i$ to
 341   $j$.}:
 342 \begin{align*}
 343   c_s(\SMTR{\LOC{w}} : i, j) = P_{INSIDE_s}(\SMTR{\LOC{w}}, i, j) P_{OUTSIDE_s}(\SMTR{\LOC{w}}, i, j) / P_s
 344 \end{align*}
 345
 346 $w_s$ is $w_{q}$ from \citet[p.~41]{lari-csl90}, generalized to $\SMTR{h}$ and $dir$:
 347 \begin{align*}
 348   w_s(\SEAL{a} & : \SMTR{\LOC{h}}, left, i, j) = \\
 349   & 1/P_s \sum_{k:i<k<j} ~ \sum_{\LOC{a}:i\leq loc_l(\LOC{a})<k}
 350           & P_{STOP}(\neg stop|h, left, adj(k, \LOC{h})) P_{ATTACH}(a|h, left) \\
 351   &       & P_{INSIDE_s}(\SEAL{\LOC{a}}, i, k) P_{INSIDE_s}(\SMTR{\LOC{h}}, k, j) P_{OUTSIDE_s}(\SMTR{\LOC{h}}, i, j)
 352 \end{align*}
 353 \begin{align*}
 354   w_s(\SEAL{a} & : \SMTR{\LOC{h}}, right,  i, j) = \\
 355   & 1/P_s \sum_{k:i<k<j} ~ \sum_{\LOC{a}:k\leq loc_l(\LOC{a})<j}
 356           & P_{STOP}(\neg stop|h, right, adj(k, \LOC{h})) P_{ATTACH}(a|h, right) \\
 357   &       & P_{INSIDE_s}(\SMTR{\LOC{h}}, i, k) P_{INSIDE_s}(\SEAL{\LOC{a}}, k, j) P_{OUTSIDE_s}(\SMTR{\LOC{h}}, i, j)
 358 \end{align*}
 359
 360 Let $\hat{P}$ be the new STOP/ATTACH-probabilities (since the old $P$
 361 are used in $P_{INSIDE}$ and $P_{OUTSIDE}$).
 362
 363 \subsubsection{Attachment reestimation}
 364
 365 This is based on $\hat{a}$, given in \citet[p.~41]{lari-csl90}.
 366
 367 When attaching to the left,
 368 $i<loc_l(\LOC{h})$ since we want trees with at least one attachment:
 369 \begin{align*}
 370   \hat{P}_{ATTACH} (a | h, left) = \frac
 371   { \sum_{s \in S} \sum_{\LOC{h} \in s} \sum_{i<loc_l(\LOC{h})} \sum_{j\geq loc_r(\LOC{h})} [w_s(\SEAL{a} : \GOL{\LOC{h}}, left, i, j) + w_s(\SEAL{a} : \RGOL{\LOC{h}}, left, i, j) ]}
 372   { \sum_{s \in S} \sum_{\LOC{h} \in s} \sum_{i<loc_l(\LOC{h})} \sum_{j\geq loc_r(\LOC{h})} [c_s(\GOL{\LOC{h}} : i, j) + c_s(\RGOL{\LOC{h}} : i, j) ]}
 373 \end{align*}
 374
 375 Below, $j>loc_r(\LOC{h})$ since we want at least one attachment to the right:
 376 \begin{align*}
 377   \hat{P}_{ATTACH} (a | h, right) = \frac
 378   { \sum_{s \in S} \sum_{\LOC{h} \in s} \sum_{i\leq loc_l(\LOC{h})} \sum_{j>loc_r(\LOC{h})} [w_s(\SEAL{a} : \LGOR{\LOC{h}}, right, i, j) + w_s(\SEAL{a} : \GOR{\LOC{h}}, right, i, j) ]}
 379   { \sum_{s \in S} \sum_{\LOC{h} \in s} \sum_{i\leq loc_l(\LOC{h})} \sum_{j>loc_r(\LOC{h})} [c_s(\LGOR{\LOC{h}} : i, j) + c_s(\GOR{\LOC{h}} : i, j) ]}
 380 \end{align*}
 381
 382 For the first/lowest attachments, $w_s$ and $c_s$ have zero probability
 383 where $i<loc_l(\LOC{h})$ (for $\GOR{h}$) or $j>loc_r(\LOC{h})$ (for $\GOL{h}$),
 384 this is implicit in $P_{INSIDE}$.
 385
 386
 387
 388 \subsubsection{Stop reestimation}
 389 The following is based on \citet[p.~88]{klein-thesis}. For the
 390 non-adjacent rules, $i<loc_l(\LOC{h})$ on the left and
 391 $j>loc_r(\LOC{h})$ on the right, while for the adjacent rules these
 392 are equal (respectively)\footnote{For left-stopping with right-first
 393   attachments, $j \geq loc_r(\LOC{h})$ since we may have
 394   right-attachments below, for left-stopping with left-first
 395   attachments we only need $j=loc_r(\LOC{h})$.}.
 396
 397 To avoid some redundancy below, define a helper function $\hat{d}$ as follows:
 398 \begin{align*}
 399   \hat{d}(\SMTR{h},\XI,\XJ) =
 400   { \sum_{s \in S} \sum_{\SMTR{\LOC{h}}:\LOC{h} \in s} \sum_{i:i \XI loc_l(\LOC{h})} \sum_{j:j \XJ loc_r(\LOC{h})} c_s(\SMTR{\LOC{h}} : i, j) }
 401 \end{align*}
 402
 403 Then these are our reestimated stop probabilities:
 404 \begin{align*}
 405   \hat{P}_{STOP} (STOP|h, left, non\text{-}adj) = \frac
 406   { \hat{d}(\SEAL{h},<,\geq) + \hat{d}(\LGOR{h},<,=) }
 407   { \hat{d}(\RGOL{h},<,\geq) + \hat{d}(\GOL{h},<,=) }
 408 \end{align*}
 409
 410 \begin{align*}
 411   \hat{P}_{STOP} (STOP|h, left, adj) = \frac
 412   { \hat{d}(\SEAL{h},=,\geq) + \hat{d}(\LGOR{h},=,=) }
 413   { \hat{d}(\RGOL{h},=,\geq) + \hat{d}(\GOL{h},=,=) }
 414 \end{align*}
 415
 416 \begin{align*}
 417   \hat{P}_{STOP} (STOP|h, right, non\text{-}adj) = \frac
 418   { \hat{d}(\RGOL{h},=,>)  + \hat{d}(\SEAL{h},\leq,>) }
 419   { \hat{d}(\GOR{h},=,>)  + \hat{d}(\LGOR{h},\leq,>) }
 420 \end{align*}
 421
 422 \begin{align*}
 423   \hat{P}_{STOP} (STOP|h, right, adj) = \frac
 424   { \hat{d}(\RGOL{h},=,=)  + \hat{d}(\SEAL{h},\leq,=) }
 425   { \hat{d}(\GOR{h},=,=)  + \hat{d}(\LGOR{h},\leq,=) }
 426 \end{align*}
 427
 428
 429 \subsubsection{Root reestimation}
 430 Following \citet[p.~46]{prescher-em}, to find the reestimated
 431 probability of a PCFG rule, we first find the new treebank frequencies
 432 $f_{T_P}(tree)=P(tree)/P_s$, then for a rule $X' \rightarrow X$ we
 433 divide the new frequencies of the trees which use this rule and by
 434 those of the trees containing the node $X'$. $ROOT$ appears once per
 435 tree, meaning we divide by $1$ per sentence\footnote{Assuming each
 436   tree has frequency $1$.}, so $\hat{P}_{ROOT}(h)=\sum_{tree:ROOT
 437   \rightarrow \SEAL{h} \text{ used in } tree} f_{T_P}(tree)=\sum_{tree:ROOT
 438   \rightarrow \SEAL{h} \text{ used in } tree} P(tree)/P_s$, which turns into:
 439
 440 \begin{align*}
 441   \hat{P}_{ROOT} (h) = \frac
 442   {\sum_{s\in S} 1 / P_s \cdot \sum_{\LOC{h}\in s} P_{ROOT}(\LOC{h}) P_{INSIDE_s}(\SEAL{h}, 0, n_s)}
 443   {|S|}
 444 \end{align*}
 445
 446
 447
 448
 449 \subsection{Alternate CNF-like rules}
 450 Since the IO algorithm as described in \citet{lari-csl90} is made for
 451 rules in Chomsky Normal Form (CNF), we have an alternate grammar
 452 (table \ref{tab:cnf-like}) for running tests, where we don't have to sum
 453 over the different $loc(h)$ in IO. This is not yet generalized to
 454 include left-first attachment. It is also not quite CNF, since it
 455 includes some unary rewrite rules.
 456
 457 \begin{table*}[ht]
 458   \centering
 459   \begin{tabular} % four left-aligned math tabs, one vertical line
 460     { >{$}l<{$} >{$}l<{$} >{$}l<{$} | >{$}l<{$} }
 461     \multicolumn{3}{c}{Rule} & \multicolumn{1}{c}{$P_{RULE}$ ($a[i,j,k]$ in \citet{lari-csl90})}\\
 462     \hline{}
 463
 464     \RN{\GOR{h}} \rightarrow& \GOR{h} &\SEAL{a}        &P_{STOP}(\neg stop|h, right, adj) \cdot P_{ATTACH}(a|h, right) \\
 465     &&&\\
 466     \RN{\GOR{h}} \rightarrow& \RN{\GOR{h}} &\SEAL{a}   &P_{STOP}(\neg stop|h, right, non\text{-}adj) \cdot P_{ATTACH}(a|h, right) \\
 467     &&&\\
 468     \RGOL{h} \rightarrow& \GOR{h} &STOP       &P_{STOP}(stop|h, right, adj) \\
 469     &&&\\
 470     \RGOL{h} \rightarrow& \RN{\GOR{h}} &STOP  &P_{STOP}(stop|h, right, non\text{-}adj) \\
 471     &&&\\
 472     \LN{\RGOL{h}} \rightarrow& \SEAL{a} &\RGOL{h}      &P_{STOP}(\neg stop|h, left, adj) \cdot P_{ATTACH}(a|h, left) \\
 473     &&&\\
 474     \LN{\RGOL{h}} \rightarrow& \SEAL{a} &\LN{\RGOL{h}} &P_{STOP}(\neg stop|h, left, non\text{-}adj) \cdot P_{ATTACH}(a|h, left) \\
 475     &&&\\
 476     \SEAL{h} \rightarrow& STOP &\RGOL{h}      &P_{STOP}(stop|h, left, adj) \\
 477     &&&\\
 478     \SEAL{h} \rightarrow& STOP &\LN{\RGOL{h}} &P_{STOP}(stop|h, left, non\text{-}adj) \\
 479   \end{tabular}
 480   \caption{Alternate CFG rules (where a child node has an arrow below,
 481     we use non-adjacent probabilities), defined for all words/POS-tags
 482     $h$.}\label{tab:cnf-like}
 483 \end{table*}
 484
 485 The inside probabilities are the same as those given in
 486 \citet{lari-csl90}, with the following exceptions:
 487
 488 When calculating $P_{INSIDE}(\SMTR{h}, i, j)$ and summing through
 489 possible rules which rewrite $\SMTR{h}$, if a rule is of the form
 490 $\SMTR{h} \rightarrow STOP ~ \SDTR{h}$ or $\SMTR{h} \rightarrow
 491 \SDTR{h} ~ STOP$, we add $P_{RULE}\cdot P_{INSIDE}(\SDTR{h}, i, j)$
 492 (that is, rewrite for the same sentence range); and, as a consequence
 493 of these unary rules: for ``terminal rules'' ($P_{ORDER}$) to be
 494 applicable, not only must $i = j-1$, but also the left-hand side
 495 symbol of the rule must be of the form $\GOR{h}$.
 496
 497 Similarly, the outside probabilities are the same as those for pure
 498 CNF rules, with the exception that we add the unary rewrite
 499 probabilities
 500 \begin{align*}
 501   \sum_{\SMTR{h}} [&P_{OUTSIDE}(\SMTR{h},i,j)\cdot P_{RULE}(\SMTR{h} \rightarrow \SDTR{h} ~ STOP) \\
 502           + &P_{OUTSIDE}(\SMTR{h},i,j)\cdot P_{RULE}(\SMTR{h} \rightarrow STOP ~ \SDTR{h})]
 503 \end{align*}
 504 to $P_{OUTSIDE}(\SDTR{h},i,j)$ (eg. $f(s,t,i)$).
 505
 506 This grammar gave the same results for inside and outside
 507 probabilities when run over our corpus.
 508
 509 \subsection{Initialization}
 510 \citet{klein-thesis} describes DMV initialization using a ``harmonic
 511 distribution'' for the initial probabilities, where the probability of
 512 one word heading another is higher if they appear closer to one
 513 another.
 514
 515 There are several ways this could be implemented. We initialized attachment
 516 probabilities with the following formula:
 517
 518 \begin{align*}
 519   P_{ATTACH}(a|h,right) = \frac
 520   {\sum_{s \in S}\sum_{\LOC{h} \in s} \sum_{\LOC{a} \in s:loc(\LOC{a})>loc(\LOC{h})} 1/(loc(\LOC{a})-loc(\LOC{h})) + C_A}
 521   {\sum_{s \in S}\sum_{\LOC{h} \in s} \sum_{\LOC{w} \in s:loc(\LOC{w})>loc(\LOC{h})} 1/(loc(\LOC{w})-loc(\LOC{h})) + C_A}
 522 \end{align*}
 523
 524 The probability of stopping adjacently (left or right) was increased
 525 whenever a word occured at a (left or right) sentence
 526 border\footnote{For non-adjacent stopping we checked for occurence at
 527   the second(-to-last) position.}:
 528
 529 \begin{align*}
 530   f(stop:\LOC{h},left,adj)=\begin{cases}
 531     C_S \text{, if } loc(\LOC{h}) = 0,\\
 532     0 \text{, otherwise}
 533   \end{cases}
 534 \end{align*}
 535
 536 \begin{align*}
 537   P_{STOP}(stop|h,left,adj) = \frac
 538   {C_{M} + \sum_{s \in S}\sum_{\LOC{h} \in s} f(stop:\LOC{h},left,adj)}
 539   {C_{M} + \sum_{s \in S}\sum_{\LOC{h} \in s} C_S+C_N}
 540 \end{align*}
 541
 542 Tweaking the initialization constants $C_A, C_M, C_S$ and $C_N$
 543 allowed us to easily try different inital distributions (eg. to modify
 544 the initial grammar in the direction of a uniform distribution). The
 545 next section discusses the effect of these.
 546
 547 \subsection{Results}
 548 We compared the results of the implementation with a dependency parsed
 549 version of the WSJ-10 corpus (converted from the manually annotated
 550 version using a Perl script by Valentin Jijkoun). Since single word
 551 sentences were not POS-tagged there, these were skipped. Also, the
 552 dependency parsed WSJ-10 did not have ROOT nodes; so we calculated
 553 precision and recall both without counting our ROOT links, and with
 554 counting ROOT links, by adding these to the gold parses where
 555 possible\footnote{221 parses in the dependency parsed WSJ-10 had
 556   several tokens appearing as heads without appearing as dependents in
 557   the same parse, here we skipped the parses when calculating with
 558   ROOT links. Our gold standard also sometimes (for 1249 sentences)
 559   had one dependent with two heads, these were skipped from
 560   evaluation. We have not yet had time to run evaluation of undirected
 561   dependencies.}.
 562
 563 \begin{table*}[hb]
 564   \centering
 565   \begin{tabular}{l|cc}
 566     Model                          & \multicolumn{2}{c}{F1, directed}  \\
 567     \hline
 568     LBRANCH/RHEAD                  & \multicolumn{2}{c}{33.6} \\
 569     RANDOM                         & \multicolumn{2}{c}{30.1} \\
 570     RBRANCH/LHEAD                  & \multicolumn{2}{c}{24.0} \\
 571     K\&M's DMV                     & \multicolumn{2}{c}{43.2} \\[3pt]
 572     Our DMV:                       & no ROOT & ROOT added     \\
 573     \hline
 574     Uniform initial distribution   & 24.9    & 22.1 \\
 575     $C_A=0; C_S=1;C_N=0.1;C_M=1$    & 25.2    & 24.9 \\
 576     $C_A=0; C_S=1;C_N=0.1;C_M=10$   & 26.9    & 31.9 \\
 577 %    $C_A=10;C_S=1;C_N=0.1;C_M=10$ & 25.0    & 30.1 \\
 578     $C_A=10;C_S=1;C_N=3  ;C_M=10$   & 26.3    & 31.4 \\
 579     $C_A=15;C_S=3;C_N=1  ;C_M=20$   & 27.2    & 32.2 \\
 580   \end{tabular}
 581   \caption{DMV results on the WSJ-10 for various initialization values (the right column is when we counted ROOT links added to the gold parses)}
 582   \label{tab:dmv-wsj}
 583 \end{table*}
 584
 585 We tried various values for the initialization constants; but it was
 586 hard to find any clear pattern for what worked best.
 587 %$C_N$ values of 10 or less worked well,
 588 % todo
 589
 590 Table \ref{tab:dmv-wsj} shows the results of running our
 591 implementation on the WSJ-10 corpus, compared with the dependency
 592 parsed version, for given values of the initialization constants. The
 593 EM hilltop seemed to be reached after about 30 iterations for our best
 594 results (sooner for the worse results). As the table indicates
 595 (especially the numbers for ``rooted'' evaluation), different initial
 596 settings can lead to very different results\footnote{One alternative
 597   idea we still haven't tried is basing initial DMV stopping
 598   probabilities on \emph{distance} from sentence start/end, instead of
 599   the ``binary'' test given above.}.
 600
 601
 602 The numbers are not completely comparable though, since our corpus was
 603 6268 sentences, while \citet{km-dmv} report 7422 sentences; also it is
 604 not clear whether their DMV-experiment was run using automatically
 605 induced word classes \citep[Schütze, 1995, in][p.~8]{km-dmv} or on the
 606 tagset used to manually annotate the WSJ-10.
 607
 608 % underproposed =     # $C_A=10;C_S=1;C_N=3  ;C_M=10$
 609 % {(('NN', 1), ('DT', 0)): 347,
 610 %  (('NN', 2), ('DT', 0)): 148,
 611 %  (('NN', 3), ('DT', 2)): 136,
 612 %  (('NN', 4), ('DT', 3)): 144,
 613 %  (('NN', 5), ('DT', 4)): 128,
 614 %  (('NN', 6), ('DT', 5)): 124,
 615 %  (('NNP', 1), ('NNP', 0)): 358,
 616 %  (('NNP', 2), ('NNP', 0)): 125,
 617 %  (('NNP', 2), ('NNP', 1)): 174,
 618 %  (('NNS', 1), ('JJ', 0)): 124,
 619 %  (('ROOT', -1), ('NN', 0)): 100,
 620 %  (('ROOT', -1), ('NN', 1)): 106,
 621 %  (('ROOT', -1), ('NNP', 1)): 140,
 622 %  (('ROOT', -1), ('NNP', 2)): 104,
 623 %  (('VBD', 2), ('NN', 1)): 145,
 624 %  (('VBP', 2), ('NNS', 1)): 111,
 625 %  (('VBZ', 2), ('NN', 1)): 152,
 626 %  (('VBZ', 2), ('NNP', 1)): 107,
 627 %  (('VBZ', 3), ('NN', 2)): 109, }
 628
 629 In the stochastic grammars given by the model after EM, the POS class
 630 $VBD$ (past tense verb) had the highest probability of being attached
 631 to by ROOT, followed by $VBZ$ (3sg present) and $VBP$ (non-3sg
 632 present); which looks promising. Interestingly, on fourth place we see
 633 the tag $:$, and on sixth, $``$. Results might have been different
 634 with such tokens stripped from the corpus and the evaluation standard,
 635 we haven't tried this yet\footnote{A ROOT attachment to $:$ does make
 636   sense though, in the same way that a comma might be the head of a
 637   conjunction. Our evaluation standard almost always
 638   % 97 per cent of the time..
 639   puts colons as dependents of the phrase head (often an NP
 640   fragment).}.
 641
 642 Just like \citet[p.~89]{klein-thesis} our algorithm chose determiners
 643 to be the heads of nouns, rather than the opposite.
 644 % Also, in general
 645 % non-adjacent stopping was a lot more likely than adjacent stopping, a
 646 % situation given by the DMV model.
 647
 648
 649 % \section{The combined model (?)}
 650 % \subsection{Results (?)}
 651
 652
 653 \section{Conclusion}
 654 From our implementations it seems that these processes capture some of
 655 the statistical regularities in language, eg. the strong likelihood of
 656 verbal sentence heads; however, our evaluations show that we did not
 657 manage to replicate Klein \& Manning's parsing results, let alone
 658 improve on them.
 659
 660
 661 \nocite{lari-csl90}
 662 \nocite{klein-thesis}
 663 \nocite{km-dmv}
 664 \bibliography{./statistical.bib}
 665 \bibliographystyle{plainnat}
 666
 667 \end{document}
 668
 669
 670