added ccm.py
[dmvccm.git] / report / report.tex
blob328438be86f6d959be71f80d3879dbc126de2da5
1 % merged Kevin's & Emily's reports 2008-09-21 Sun 19:04 with Emerge
3 \documentclass[11pt,a4paper]{article}
4 \usepackage[utf8]{inputenc}
5 \usepackage[T1]{fontenc}
6 \usepackage{hyperref}
7 \usepackage{natbib}
9 \usepackage{pslatex}
10 \usepackage{pdfsync}
11 \pdfoutput=1
13 \usepackage{qtree}
14 \usepackage{amsmath}
15 \usepackage{amssymb}
17 \usepackage{avm}
18 \avmfont{\sc}
19 \avmoptions{sorted,active}
20 \avmvalfont{\rm}
21 \avmsortfont{\scriptsize\it}
23 \usepackage{array} % for a tabular with math in it
25 \title{The DMV and CCM models}
26 \author{Emily Morgan \& Kevin Brubeck Unhammer}
27 \date{15 September 2008}
29 \begin{document}
31 \maketitle
33 \tableofcontents
35 \section{Introduction}
36 \citet{klein-thesis} describes two models for the unsupervised
37 learning of syntactic structure. One is the Constituent Context Model
38 (CCM), which uses the context as well as the content of a substring to determine whether it
39 is a constituent. The context of a substring is defined to be the words
40 immediately preceding and following the substring. The intuition is that context can be a reliable cue for constituentness because when a long phrase is a constituent, there is often a similar case where a single word will serve as a constituent in the same context. Thus to determine the
41 likelihood that a substring is a constituent, this model considers both
42 the likelihood of the substring itself as a constituent and the
43 likelihood of its context as the context of a constituent.
45 The other is a Dependency Model with Valence (DMV), which uses a form
46 of Dependency Grammar. Here sentence structure is represented as pairs
47 of heads and dependents (arguments), although it is also possible to
48 represent this as a CFG (see \citet[p.~106]{klein-thesis}).
50 Apart from identity of the argument, probabilities are conditional on
51 direction of attachment. A valence condition is introduced by using
52 stopping probabilities: a head may take zero, one or more dependents
53 before ``stopping'', and there are different probabilities for
54 stopping without taking arguments (\emph{adjacently}) versus after
55 taking one or more (\emph{non-adjacently}).
57 Versions of the inside-outside algorithm for PCFG's \citep{lari-csl90}
58 are used for running estimation maximation on both these probabilistic
59 models. We tried implementing both the CCM and the DMV. Below we give
60 the details of how this was done, along with the results.
62 \section{The Constituent Context Model}
63 \subsection{Basics}
64 The CCM model \citep{klein-thesis, km-ccm} is a generative model over a set of Part Of Speech (POS) tags $\mathbb{T}$ and a beginning/end of sentence symbol $\diamond$.
65 The CCM model has two parameters:
66 \begin{itemize}
67 \item $P_{SPAN}(\alpha | c)$ is the probability of generating a span $\alpha$ given that it is/is not a constituent; $\alpha \in \mathbb{T}^*, c \in \{\text{true}, \text{false}\}$.
68 \item $P_{CONTEXT}(\beta | c)$ is the probability of generating a context $\beta$ given that the span it surrounds is/is not a constituent; $\beta \in (\mathbb{T}\cup\diamond)^2,c \in \{\text{true}, \text{false}\}$.
69 \end{itemize}
70 The probability of a sentence $s$ with bracketing $B$ is the probability of that bracketing times the probability of the span and the context for each substring of $s$. The probability of bracketings are the fixed distribution $P_\text{bin}$, which assigns equal weight to all bracketings that correspond to binary trees. For a sentence $s$ of length $n$, we index spaces between words in the sentence, including the leading space (0) and the trailing space ($n$). Let $\langle i,j\rangle_s$ be the substring from $i$ to $j$, and $\rangle i,j\langle_s$ be the context of that substring. Then we have: $$P(s,B)=P_{\text{bin}}(B)\displaystyle\prod_{\{\langle i,j\rangle_s\}}P_{\text{SPAN}}(\langle i,j\rangle_s|B_{ij})P_{\text{CONTEXT}}(\rangle i,j\langle_s|B_{ij})$$
72 \subsection{Reestimation}
73 We use the EM algorithm to induce our model parameters. Given the parameters $P_{\text{SPAN}}$ and $P_{\text{CONTEXT}}$, we wish to first find the likelihoods of completions for all sentences in a corpus given the current parameter values, and then find new parameters that maximize the corpus likelihood given the previously predicted completions. Because enumerating all bracketings for all sentences in computationally infeasible, we use a dynamic programming algorithm similar to the inside-outside algorithm.
75 We begin by rewriting the probability $P(s,B)$ given above:
76 $$P(s,B)=P_{\text{bin}}(B)\displaystyle\prod_{\{\langle i,j\rangle_s\}}P_{\text{SPAN}}(\langle i,j\rangle_s|\text{false})P_{\text{CONTEXT}}(\rangle i,j\langle_s|\text{false}) \times$$$$ \prod_{\{\langle i,j\rangle_s\in B\}}\frac{P_{\text{SPAN}}(\langle i,j\rangle_s|\text{true})P_{\text{CONTEXT}}(\rangle i,j\langle_s|\text{true})}{P_{\text{SPAN}}(\langle i,j\rangle_s|\text{false})P_{\text{CONTEXT}}(\rangle i,j\langle_s|\text{false})}$$
77 We can then condense this by letting $$\phi(i,j,s)=\frac{P_{\text{SPAN}}(\langle i,j\rangle_s|\text{true})P_{\text{CONTEXT}}(\rangle i,j\langle_s|\text{true})}{P_{\text{SPAN}}(\langle i,j\rangle_s|\text{false})P_{\text{CONTEXT}}(\rangle i,j\langle_s|\text{false})}$$
78 Further, note that $P_{\text{bin}}$ is constant for all bracketings of a sentence. We can thus also write $$K(s)=P_{\text{bin}}(B)\displaystyle\prod_{\{\langle i,j\rangle_s\}}P_{\text{SPAN}}(\langle i,j\rangle_s|\text{false})P_{\text{CONTEXT}}(\rangle i,j\langle_s|\text{false})$$ and $K(s)$ is constant for a given sentence.
79 We thus have a condensed formulation of $P(s,B)$:
80 $$P(s,B)=K(s) \prod_{\{\langle i,j\rangle_s\in B\}}\phi(i,j,s)$$
82 In the spirit of the inside-outside algorithm, we can use a dynamic programming algorithm to calculate the product of $\phi(i,j,s)$ for a sentence. We wish to calculate the products of the $\phi(i,j,s)$ both inside and outside a given interval $\langle i,j\rangle$. Let $\mathcal{T}(m)$ be the set of binary tree structures with $m$ leaves. We then define $$I(i,j,s)=\displaystyle\sum_{T\in \mathcal{T}(j-i)}\prod_{\langle a,b\rangle:\langle a-i,b-i\rangle\in T}\phi(a,b,s)$$
83 To calculate I, we want to find a recursive decomposition. In the case where $j=i$, there can be no constituents between $i$ and $j$, so we have $I(i,i,s)=0$. In the case where $j=i+1$, there is necessary exactly one constituent between $i$ and $j$ because all one-word substrings are constituents, so $I(i,i+1,s)=\phi(i,i+1,s)$. In the case where $j>i+1$, any constituent spanning from $i$ to $j$ must also contain subconstituents from $i$ to $k$ and from $k$ to $j$. Thus
84 $$I(i,j,s)=\displaystyle\sum_{i<k<j}\sum_{T\in\mathcal{T}(k-i)}\sum_{T'\in\mathcal{T}(j-k)}\phi(i,j,s)\prod_{\langle a,b\rangle:\langle a-i,b-i\rangle\in T}\phi(a,b,s)\prod_{\langle a',b'\rangle:\langle a'-k,b'-k\rangle\in T'}\phi(a',b',s)$$
85 $$=\phi(i,j,s)\displaystyle\sum_{i<k<j}\sum_{T\in\mathcal{T}(k-i)}\sum_{T'\in\mathcal{T}(j-k)}\prod_{\langle a,b\rangle:\langle a-i,b-i\rangle\in T}\phi(a,b,s)\prod_{\langle a',b'\rangle:\langle a'-k,b'-k\rangle\in T'}\phi(a',b',s)$$
86 $$=\phi(i,j,s)\displaystyle\sum_{i<k<j}I(i,k,s)I(k,j,s)$$
88 In summary, we have a recursive decomposition for I:
89 $$I(i,j,s) = \begin{cases}
90 \phi(i,j,s)\displaystyle\sum_{i<k<j}I(i,k,s)I(k,j,s) & \text{if } j-i>1 \\
91 \phi(i,j,s) & \text{if } j-i=1 \\
92 0 & \text{if } j-i = 0
93 \end{cases}$$
95 Similarly, we define the outside probability
96 $$O(i,j,s) = \begin{cases}
97 \displaystyle\sum_{0\leq k<i}I(k,i,s)\phi(k,j,s)O(k,j,s) + \sum_{j<k\leq n} I(j,k,s)\phi(i,k,s)O(i,k,s) &
98 \text{if } j-i < n \\
99 1 & \text{if } j-i = n
100 \end{cases}$$
102 To reestimate the parameters $P_{\text{SPAN}}$ and $P_{\text{CONTEXT}}$, we wish to know $P_{\text{BRACKET}}(i,j,s)$, the fraction of bracketings of $s$ that contain $\langle i,j\rangle$ as a constituent:
103 $$P_{\text{BRACKET}}(i,j,s)=\frac{\displaystyle\sum_{\{B:B_{ij}=\text{true}\}}P(s,B)}{\displaystyle\sum_{B}P(s,B)}$$
104 We can rewrite this in terms of our inside and outside probabilities:
105 $$\displaystyle\sum_{\{B:B_{ij}=\text{true}\}}P(s,B)=K(s)I(i,j,s)O(i,j,s)$$
107 $$\displaystyle\sum_{B}P(s,B)=K(s)I(0,n,s)O(0,n,s)$$
108 But the $K(s)$ terms cancel, and $O(0,n,s)=1$, so we find that $$P_{\text{BRACKET}}(i,j,s)=\frac{I(i,j,s)O(i,j,s)}{I(0,n,s)}$$
110 From $P_{\text{BRACKET}}$ we can calculate the next iteration of the model parameters $P_{\text{SPAN}}$ and $P_{\text{CONTEXT}}$. First, for a sentence $s$ of length $n_s$, note that the number of spans in the sentence is $N_s=(n_s+1)(n_s+2)/2$, which includes the spans of length 0. Furthemore, let $\#_s(\gamma)$ be the number of times that $\gamma$ appears in $s$, where $\gamma$ can be either a span or a context.
112 We begin with
113 $$P_{\text{SPAN}}(\alpha|c)=\displaystyle\sum_{s\in S}P(\alpha|c,s)P(s)=\frac{1}{|S|}\sum_{s\in S}P(\alpha|c,s)$$
114 This reduction works for $P_{\text{CONTEXT}}$ as well, so we have reduced the problem of calculating $P_{\text{SPAN}}$ and $P_{\text{CONTEXT}}$ on the corpus as a whole to calculating it for each sentence.
116 Let us begin by calculating $P_{\text{SPAN}}(\alpha|\text{true},s)$. By Bayes' Theorem, $$P_{\text{SPAN}}(\alpha|\text{true},s)=\frac{P_{\text{SPAN}}(\text{true}|\alpha,s)P_{\text{SPAN}}(\alpha|s)}{P_{\text{SPAN}}(\text{true}|s)}$$
117 We can compute each of the quantities on the right hand side:
118 $$P_{\text{SPAN}}(\alpha|s)=\frac{\#_s(\alpha)}{N_s}$$
119 $$P_{\text{SPAN}}(\text{true}|s)=\frac{2n_s-1}{N_s}$$
120 $$P_{\text{SPAN}}(\text{true}|\alpha,s)=\frac{1}{\#_s(\alpha)}\displaystyle\sum_{\langle i,j\rangle=\alpha}P_{\text{BRACKET}}(i,j,s)$$
122 Thus we find that
123 $$P_{\text{SPAN}}(\alpha|\text{true},s)=\frac{1}{\#_s(\alpha)}\displaystyle\sum_{\langle i,j\rangle=\alpha}P_{\text{BRACKET}}(i,j,s)\frac{\#_s(\alpha)}{N_s}\frac{N_s}{2n_s-1}=\frac{1}{2n_s-1}\displaystyle\sum_{\langle i,j\rangle=\alpha}P_{\text{BRACKET}}(i,j,s)$$
125 A similar calculation shows that $$P_{\text{CONTEXT}}(\beta|\text{true},s)=\frac{1}{2n_s-1}\displaystyle\sum_{\rangle i,j\langle=\beta}P_{\text{BRACKET}}(i,j,s)$$
127 Now let us consider $P_{\text{SPAN}}(\alpha|\text{false},s)$. Again by Bayes' Theorem we have $$P_{\text{SPAN}}(\alpha|\text{false},s)=\frac{P_{\text{SPAN}}(\text{false}|\alpha,s)P_{\text{SPAN}}(\alpha|s)}{P_{\text{SPAN}}(\text{false}|s)}$$
129 $P_{\text{SPAN}}(\alpha|s)$ is as before. $P_{\text{SPAN}}(\text{false}|s)=1-P_{\text{SPAN}}(\text{true}|s)$. And $P_{\text{SPAN}}(\text{false}|\alpha,s)=1-P_{\text{SPAN}}(\text{true}|\alpha,s)$. Thus
131 P_{\text{SPAN}}(\alpha|\text{false},s)=\frac
132 {(1-\frac
134 {\#_s(\alpha)}
135 \displaystyle\sum_{\langle i,j\rangle=\alpha}
136 P_{\text{BRACKET}}(i,j,s))\frac
137 {\#_s(\alpha)}
138 {N_s}}
139 {1-\frac{2n_s-1}{N_s}}
141 $$=\frac
142 {\#(\alpha)-\Sigma_{<i,j>=\alpha}P_{BRACKET}(i,j,s)}
143 {N_s-2n_s+1}$$
145 Similarly, we have $$P_{CONTEXT}(\beta|false,s) = \frac{\#(\beta)-\Sigma_{>i,j<=\beta}P_{BRACKET}(i,j,s)}{N_s-2n_s+1}$$
147 From these, we can calculate the new parameters $P_{\text{SPAN}}$ and $P_{\text{CONTEXT}}$ as above.
149 \subsection{Initialization}
150 We initialize the model with a distribution $P_{\text{SPLIT}}$ over trees. $P_{\text{SPLIT}}$ is calculated as follows: for a string of $n$ words, choose a split point at random, then recursively build trees on either side of that split by further random splitting. We initialize the model using $P_{\text{BRACKET}}(i,j,s)=P_{\text{SPLIT}}(i,j,n_s)$. However, we also apply smoothing, adding small constant counts $m_\text{true}$ and $m_\text{false}$ of each span and context as a constituent and distituent, respectively. $m_\text{true}$ should be smaller than $m_\text{false}$, as there are more distituent spans than constituent spans. However, \citet{klein-thesis} and \citet{km-ccm} give different values for these constants, and in any case they must vary in relation to the size of the corpus used. Let $\#_\alpha$ and $\#_\beta$ be the number of unique spans and contexts in the corpus. We initialize the model:
151 \begin{itemize}
152 \item $P_{SPAN}(\alpha|true)= \frac {(\displaystyle\sum_{s\in S}\displaystyle\sum_{<i,j>=\alpha}P_{\text{SPLIT}}(i,j,n_s))+m_\text{true}}
153 {(\displaystyle\sum_{s\in S}2n_s-1)+m_\text{true} \#_\alpha}$
154 \item $P_{SPAN}(\alpha|false)= \frac {(\displaystyle\sum_{s\in S}\displaystyle\sum_{<i,j>=\alpha}1-P_{\text{SPLIT}}(i,j,n_s))+m_\text{false}}
155 {(\displaystyle\sum_{s\in S}N_s-(2n_s-1))+m_\text{false} \#_\alpha}$
156 \item $P_{CONTEXT}(\beta|true)= \frac {(\displaystyle\sum_{s\in S}\displaystyle\sum_{>i,j<=\beta}P_{\text{SPLIT}}(i,j,n_s))+m_\text{true}}
157 {(\displaystyle\sum_{s\in S}2n_s-1)+m_\text{true} \#_\beta}$
158 \item $P_{CONTEXT}(\beta|false)= \frac {(\displaystyle\sum_{s\in S}\displaystyle\sum_{>i,j<=\beta}1-P_{\text{SPLIT}}(i,j,s))+m_\text{false}}
159 {(\displaystyle\sum_{s\in S}N_s-(2n_s-1))+m_\text{false} \#_\beta}$
160 \end{itemize}
162 \subsection{Smoothing}
163 In addition to the smoothing used during initialization, some further smoothing is necessary during reestimation. In our initial calculations, we made the critical step of factoring out $\displaystyle\prod_{\{\langle i,j\rangle_s\}}P_{\text{SPAN}}(\langle i,j\rangle_s|\text{false})P_{\text{CONTEXT}}(\rangle i,j\langle_s|\text{false})$. This factorization assumes that $P_{\text{SPAN}}(\langle i,j\rangle_s|\text{false})$ and $P_{\text{CONTEXT}}(\rangle i,j\langle_s|\text{false})$ are nonzero for all spans and contexts. Particularly in the case of small corpora, this assumption is not always valid. For example, if a span $\alpha$ always appears as a full sentence--never as a proper substring--then its $P_{\text{SPAN}}(\alpha|\text{false})$ will be zero because a full sentence is never a distituent. Or if a word $a$ only appears as the second word of a sentence, then the context $(\diamond,a)$ will have $P_{\text{CONTEXT}}((\diamond,a)|\text{false})=0$ because the one-word span that it contains will always be a constituent.
165 To avoid this problem, we can use further smoothing during reestimation. During each reestimation of $P_{\text{SPAN}}$ and $P_{\text{CONTEXT}}$, we add to each span and context small constant counts of $m_\text{true}$ as a constituent and $m_\text{false}$ as a distituent.
167 \subsection{Results}
168 Our implementation of the CCM model has not yet been tested on the Wall Street Journal corpus. On small toy corpora, it behaves unpredictably. For instance, on the corpus consisting of the sentences $a b c$ and $b c a$, the model gives higher probabilities to the spans $a b$ and $c a$ as constituents than it does to $b c$. However, this is the opposite of what we would expect. We have not yet been able to identify the source of this problem.
170 \section{A Dependency Model with Valence}
171 \newcommand{\LOC}[1]{\textbf{#1}}
172 \newcommand{\GOR}[1]{\overrightarrow{#1}}
173 \newcommand{\RGOL}[1]{\overleftarrow{\overrightarrow{#1}}}
174 \newcommand{\SEAL}[1]{\overline{#1}}
175 \newcommand{\LGOR}[1]{\overrightarrow{\overleftarrow{#1}}}
176 \newcommand{\GOL}[1]{\overleftarrow{#1}}
177 \newcommand{\LN}[1]{\underleftarrow{#1}}
178 \newcommand{\RN}[1]{\underrightarrow{#1}}
179 \newcommand{\XI}{\lessdot}
180 \newcommand{\XJ}{\gtrdot}
181 \newcommand{\SMTR}[1]{\dddot{#1}}
182 \newcommand{\SDTR}[1]{\ddot{#1}}
184 The DMV is a \emph{head-outward process}. Given a certain head (eg. a
185 verb), we first attach possible arguments in one direction (eg. nouns
186 to the right), then stop in that direction, then possibly attach
187 arguments in the other direction, then stop. The probabilities of a
188 certain head attaching any arguments in a direction sum to one
189 \citep[see~$P(D(h))$~in][p.~87]{klein-thesis} -- modelled as a PCFG,
190 however \citep[p.~83,~figure~6.4]{klein-thesis}, probabilities may sum
191 to less than one given a certain left-hand side symbol.
193 In the following sections we try to flesh out the details of the
194 inside-outside algorithm \citep{lari-csl90} applied to the DMV model
195 of \citet{klein-thesis}. We have three parameters which are
196 reestimated using the inside-outside algorithm:
198 \begin{itemize}
199 \item $P_{ATTACH}(a|h,dir)$ is the probability of attaching an
200 argument with tag $a$, given that we are attaching in direction
201 $dir$ from head $h$. Here $a,h \in \mathbb{T},
202 dir\in\{left,right\}$.
203 \item $P_{STOP}(stop|h,dir,adj)$ is the probability of stopping (which
204 happens after argument attachment), given that we are attaching from
205 head $h$ in direction $dir$. Adjacency $adj$ is whether or not we
206 have attached anything yet in that direction.
207 \item $P_{ROOT}(h)$ is the probability of the special ROOT symbol
208 heading $h$; in the original model this is modelled using a right
209 stop, a left stop, and an attachment
210 \citep[p.~105,~figure~A.2]{klein-thesis}; we conflated these
211 into one additional parameter.
212 \end{itemize}
214 There is also a parameter $P_{ORDER}$, signifying the probability of a
215 head $w$ attaching to the right first, or to the left first. In the
216 results given below,
217 \[ \forall{w}[P_{ORDER}(right\text{-}first|w)=1.0] \]
218 (this parameter is not reestimated).
220 \subsection{Note on notation}
221 $i, j, k$ are sentence positions (between words), where $i$ and $j$
222 are always the start and end, respectively, for what we're calculating
223 ($k$ is between $i$ and $j$ for $P_{INSIDE}$, to their right or left
224 for $P_{OUTSIDE}$). $s \in S$ are sentences in the corpus. $\LOC{w}$
225 is a word token (actually POS-token) of type $w$ at a certain sentence
226 location. If $\LOC{w}$ is between $i$ and $i+1$, $loc(\LOC{w})=i$
227 following \citet{klein-thesis}, meaning $i$ is adjacent to $\LOC{w}$
228 on the left, while $j=loc(\LOC{w})+1$ means that $j$ is adjacent to
229 $\LOC{w}$ on the right. To simplify, $loc_l(\LOC{w}):=loc(\LOC{w})$ and
230 $loc_r(\LOC{w}):=loc(\LOC{w})+1$. We write $\LOC{h}$ if this is a head
231 in the rule being used, $\LOC{a}$ if it is an attached argument.
233 There are some notational differences between the thesis
234 \citet{klein-thesis} and the ACL paper \citet{km-dmv}:
236 \begin{tabular}{cc}
237 Paper: & Thesis: \\
238 $w$ & $\GOR{w}$ \\
239 $w\urcorner$ & $\RGOL{w}$ \\
240 $\ulcorner{w}\urcorner$ & $\SEAL{w}$ \\
241 \end{tabular}
243 We use $\SMTR{w}$ (or $\SDTR{w}$) to signify one of either $w, \GOR{w},
244 \RGOL{w}, \LGOR{w}, \GOL{w}$ or $\SEAL{w}$\footnote{This means that
245 $\SMTR{\LOC{w}}$ is the triplet of the actual POS-tag, its sentence
246 location as a token, and the ``level of seals''.}.
249 \subsection{Inside probabilities}
250 $P_{INSIDE}$ is defined in \citet[pp.~106-108]{klein-thesis}, the only
251 thing we need to add is that for right attachments,
252 $i \leq loc_l(w)<k \leq loc_l(\LOC{a})<j$ while for left attachments,
253 $i \leq loc_l(\LOC{a})<k \leq loc_l(w)<j$.
256 \subsubsection{Sentence probability}
257 $P_s$ is the sentence probability, based on
258 \citet[p.~38]{lari-csl90}. Since the ROOT rules are different from the
259 rest, we sum them explicitly in this definition:
260 \begin{align*}
261 P_s = \sum_{\LOC{w} \in s} P_{ROOT}(\LOC{w}) P_{INSIDE}(\SEAL{\LOC{w}}, 0, n_s)
262 \end{align*}
264 \subsection{Outside probabilities}
266 \begin{align*}
267 P_{OUTSIDE_s}(ROOT, i, j) = \begin{cases}
268 1.0 & \text{ if $i = 0$ and $j = n_s$,}\\
269 0.0 & \text{ otherwise}
270 \end{cases}
271 \end{align*}
273 For $P_{OUTSIDE}(\SEAL{w}, i, j)$, $w$ is attached to under something
274 else ($\SEAL{w}$ is what we elsewhere call $\SEAL{a}$). Adjacency is
275 thus calculated on the basis of $h$, the head of the rule. If we are
276 attached to from the left we have $i \leq loc_l(\LOC{w}) < j \leq loc_l(\LOC{h}) < k$, while
277 from the right we have $k \leq loc_l(\LOC{h}) < i \leq loc_l(\LOC{w}) < j$:
278 \begin{align*}
279 P_{OUTSIDE}&(\SEAL{\LOC{w}}, i, j) = \\
280 & P_{ROOT}(w) P_{OUTSIDE}(ROOT, i, j) + \\
281 & [ \sum_{k > j} ~ \sum_{\LOC{h}:j\leq loc_l(\LOC{h})<k} \sum_{\SMTR{\LOC{h}} \in \{\RGOL{\LOC{h}},\GOL{\LOC{h}}\}} P_{STOP}(\neg stop|h, left, adj(j, \LOC{h})) P_{ATTACH}(w|h, left) \\
282 & \qquad \qquad \qquad \qquad \qquad P_{OUTSIDE}(\SMTR{\LOC{h}}, i, k) P_{INSIDE}(\SMTR{\LOC{h}}, j, k) ] ~ + \\
283 & [ \sum_{k < i} ~ \sum_{\LOC{h}:k\leq loc_l(\LOC{h})<i} \sum_{\SMTR{\LOC{h}} \in \{\LGOR{\LOC{h}},\GOR{\LOC{h}}\}} P_{STOP}(\neg stop|h, right, adj(i, \LOC{h})) P_{ATTACH}(w|h, right) \\
284 & \qquad \qquad \qquad \qquad \qquad P_{INSIDE}(\SMTR{\LOC{h}}, k, i) P_{OUTSIDE}(\SMTR{\LOC{h}}, k, j) ]
285 \end{align*}
287 For $\RGOL{w}$ we know it is either under a left stop rule or it is
288 the right daughter of a left attachment rule ($k \leq loc_l(\LOC{a}) <
289 i \leq loc_l(\LOC{w}) < j$), and these are adjacent if the start point
290 ($i$) equals $loc_l(\LOC{w})$:
291 \begin{align*}
292 P_{OUTSIDE}(\RGOL{\LOC{w}}, i, j) = & P_{STOP}(stop|w, left, adj(i,
293 \LOC{w}))P_{OUTSIDE}(\SEAL{\LOC{w}}, i, j) ~ + \\
294 & [ \sum_{k < i} ~ \sum_{\LOC{a}:k\leq loc_l(\LOC{a})<i} P_{STOP}(\neg stop|w, left, adj(i, \LOC{w})) P_{ATTACH}(a|w, left) \\
295 & ~~~~~~~~~~~~~~~~~~~~~~~~~ P_{INSIDE}(\SEAL{\LOC{a}}, k, i) P_{OUTSIDE}(\RGOL{\LOC{w}}, k, j) ]
296 \end{align*}
298 For $\GOR{w}$ we are either under a right stop or the left daughter of
299 a right attachment rule ($i \leq loc_l(\LOC{w}) < j \leq
300 loc_l(\LOC{a}) < k$), adjacent iff the the end point ($j$) equals
301 $loc_r(\LOC{w})$:
302 \begin{align*}
303 P_{OUTSIDE}(\GOR{\LOC{w}}, i, j) = & P_{STOP}(stop|w, right, adj(j,
304 \LOC{w}))P_{OUTSIDE}(\RGOL{\LOC{w}}, i, j) ~ + \\
305 & [ \sum_{k > j} ~ \sum_{\LOC{a}:j\leq loc_l(\LOC{a})<k} P_{STOP}(\neg stop|w, right, adj(j, \LOC{w})) P_{ATTACH}(a|w, right) \\
306 & ~~~~~~~~~~~~~~~~~~~~~~~~~ P_{OUTSIDE}(\GOR{\LOC{w}}, i, k) P_{INSIDE}(\SEAL{\LOC{a}}, j, k) ]
307 \end{align*}
309 $\GOL{w}$ is just like $\RGOL{w}$, except for the outside probability
310 of having a stop above, where we use $\LGOR{w}$:
311 \begin{align*}
312 P_{OUTSIDE}(\GOL{\LOC{w}}, i, j) = & P_{STOP}(stop|w, left, adj(i,
313 \LOC{w}))P_{OUTSIDE}(\LGOR{\LOC{w}}, i, j) ~ + \\
314 & [ \sum_{k < i} ~ \sum_{\LOC{a}:k\leq loc_l(\LOC{a})<i} P_{STOP}(\neg stop|w, left, adj(i, \LOC{w})) P_{ATTACH}(a|w, left) \\
315 & ~~~~~~~~~~~~~~~~~~~~~~~~~ P_{INSIDE}(\SEAL{\LOC{a}}, k, i) P_{OUTSIDE}(\GOL{\LOC{w}}, k, j) ]
316 \end{align*}
318 $\LGOR{w}$ is just like $\GOR{w}$, except for the outside probability
319 of having a stop above, where we use $\SEAL{w}$:
320 \begin{align*}
321 P_{OUTSIDE}(\LGOR{\LOC{w}}, i, j) = & P_{STOP}(stop|w, right, adj(j,
322 \LOC{w}))P_{OUTSIDE}(\SEAL{\LOC{w}}, i, j) ~ + \\
323 & [ \sum_{k > j} ~ \sum_{\LOC{a}:j\leq loc_l(\LOC{a})<k} P_{STOP}(\neg stop|w, right, adj(j, \LOC{w})) P_{ATTACH}(a|w, right) \\
324 & ~~~~~~~~~~~~~~~~~~~~~~~~~ P_{OUTSIDE}(\LGOR{\LOC{w}}, i, k) P_{INSIDE}(\SEAL{\LOC{a}}, j, k) ]
325 \end{align*}
328 \subsection{DMV Reestimation}
329 $P_{INSIDE}$ and $P_{OUTSIDE}$ (the corpus frequencies given a certain
330 probability distribution) give us the counts we need to reestimate our
331 model parameters.
333 \subsubsection{$c$ and $w$}
334 First we need some helper functions.
335 $c_s(\SMTR{\LOC{w}} : i, j)$ is ``the expected fraction of parses of
336 $s$ with a node labeled $\SMTR{w}$ extending from position $i$ to
337 position $j$'' \citep[p.~88]{klein-thesis}, here defined to equal
338 $v_{q}$ of \citet[p.~41]{lari-csl90}\footnote{In terms of regular EM,
339 this is the count of trees ($f_{T_q}(x)$ in
340 \citet[p.~46]{prescher-em}) in which the node extended from $i$ to
341 $j$.}:
342 \begin{align*}
343 c_s(\SMTR{\LOC{w}} : i, j) = P_{INSIDE_s}(\SMTR{\LOC{w}}, i, j) P_{OUTSIDE_s}(\SMTR{\LOC{w}}, i, j) / P_s
344 \end{align*}
346 $w_s$ is $w_{q}$ from \citet[p.~41]{lari-csl90}, generalized to $\SMTR{h}$ and $dir$:
347 \begin{align*}
348 w_s(\SEAL{a} & : \SMTR{\LOC{h}}, left, i, j) = \\
349 & 1/P_s \sum_{k:i<k<j} ~ \sum_{\LOC{a}:i\leq loc_l(\LOC{a})<k}
350 & P_{STOP}(\neg stop|h, left, adj(k, \LOC{h})) P_{ATTACH}(a|h, left) \\
351 & & P_{INSIDE_s}(\SEAL{\LOC{a}}, i, k) P_{INSIDE_s}(\SMTR{\LOC{h}}, k, j) P_{OUTSIDE_s}(\SMTR{\LOC{h}}, i, j)
352 \end{align*}
353 \begin{align*}
354 w_s(\SEAL{a} & : \SMTR{\LOC{h}}, right, i, j) = \\
355 & 1/P_s \sum_{k:i<k<j} ~ \sum_{\LOC{a}:k\leq loc_l(\LOC{a})<j}
356 & P_{STOP}(\neg stop|h, right, adj(k, \LOC{h})) P_{ATTACH}(a|h, right) \\
357 & & P_{INSIDE_s}(\SMTR{\LOC{h}}, i, k) P_{INSIDE_s}(\SEAL{\LOC{a}}, k, j) P_{OUTSIDE_s}(\SMTR{\LOC{h}}, i, j)
358 \end{align*}
360 Let $\hat{P}$ be the new STOP/ATTACH-probabilities (since the old $P$
361 are used in $P_{INSIDE}$ and $P_{OUTSIDE}$).
363 \subsubsection{Attachment reestimation}
365 This is based on $\hat{a}$, given in \citet[p.~41]{lari-csl90}.
367 When attaching to the left,
368 $i<loc_l(\LOC{h})$ since we want trees with at least one attachment:
369 \begin{align*}
370 \hat{P}_{ATTACH} (a | h, left) = \frac
371 { \sum_{s \in S} \sum_{\LOC{h} \in s} \sum_{i<loc_l(\LOC{h})} \sum_{j\geq loc_r(\LOC{h})} [w_s(\SEAL{a} : \GOL{\LOC{h}}, left, i, j) + w_s(\SEAL{a} : \RGOL{\LOC{h}}, left, i, j) ]}
372 { \sum_{s \in S} \sum_{\LOC{h} \in s} \sum_{i<loc_l(\LOC{h})} \sum_{j\geq loc_r(\LOC{h})} [c_s(\GOL{\LOC{h}} : i, j) + c_s(\RGOL{\LOC{h}} : i, j) ]}
373 \end{align*}
375 Below, $j>loc_r(\LOC{h})$ since we want at least one attachment to the right:
376 \begin{align*}
377 \hat{P}_{ATTACH} (a | h, right) = \frac
378 { \sum_{s \in S} \sum_{\LOC{h} \in s} \sum_{i\leq loc_l(\LOC{h})} \sum_{j>loc_r(\LOC{h})} [w_s(\SEAL{a} : \LGOR{\LOC{h}}, right, i, j) + w_s(\SEAL{a} : \GOR{\LOC{h}}, right, i, j) ]}
379 { \sum_{s \in S} \sum_{\LOC{h} \in s} \sum_{i\leq loc_l(\LOC{h})} \sum_{j>loc_r(\LOC{h})} [c_s(\LGOR{\LOC{h}} : i, j) + c_s(\GOR{\LOC{h}} : i, j) ]}
380 \end{align*}
382 For the first/lowest attachments, $w_s$ and $c_s$ have zero probability
383 where $i<loc_l(\LOC{h})$ (for $\GOR{h}$) or $j>loc_r(\LOC{h})$ (for $\GOL{h}$),
384 this is implicit in $P_{INSIDE}$.
388 \subsubsection{Stop reestimation}
389 The following is based on \citet[p.~88]{klein-thesis}. For the
390 non-adjacent rules, $i<loc_l(\LOC{h})$ on the left and
391 $j>loc_r(\LOC{h})$ on the right, while for the adjacent rules these
392 are equal (respectively)\footnote{For left-stopping with right-first
393 attachments, $j \geq loc_r(\LOC{h})$ since we may have
394 right-attachments below, for left-stopping with left-first
395 attachments we only need $j=loc_r(\LOC{h})$.}.
397 To avoid some redundancy below, define a helper function $\hat{d}$ as follows:
398 \begin{align*}
399 \hat{d}(\SMTR{h},\XI,\XJ) =
400 { \sum_{s \in S} \sum_{\SMTR{\LOC{h}}:\LOC{h} \in s} \sum_{i:i \XI loc_l(\LOC{h})} \sum_{j:j \XJ loc_r(\LOC{h})} c_s(\SMTR{\LOC{h}} : i, j) }
401 \end{align*}
403 Then these are our reestimated stop probabilities:
404 \begin{align*}
405 \hat{P}_{STOP} (STOP|h, left, non\text{-}adj) = \frac
406 { \hat{d}(\SEAL{h},<,\geq) + \hat{d}(\LGOR{h},<,=) }
407 { \hat{d}(\RGOL{h},<,\geq) + \hat{d}(\GOL{h},<,=) }
408 \end{align*}
410 \begin{align*}
411 \hat{P}_{STOP} (STOP|h, left, adj) = \frac
412 { \hat{d}(\SEAL{h},=,\geq) + \hat{d}(\LGOR{h},=,=) }
413 { \hat{d}(\RGOL{h},=,\geq) + \hat{d}(\GOL{h},=,=) }
414 \end{align*}
416 \begin{align*}
417 \hat{P}_{STOP} (STOP|h, right, non\text{-}adj) = \frac
418 { \hat{d}(\RGOL{h},=,>) + \hat{d}(\SEAL{h},\leq,>) }
419 { \hat{d}(\GOR{h},=,>) + \hat{d}(\LGOR{h},\leq,>) }
420 \end{align*}
422 \begin{align*}
423 \hat{P}_{STOP} (STOP|h, right, adj) = \frac
424 { \hat{d}(\RGOL{h},=,=) + \hat{d}(\SEAL{h},\leq,=) }
425 { \hat{d}(\GOR{h},=,=) + \hat{d}(\LGOR{h},\leq,=) }
426 \end{align*}
429 \subsubsection{Root reestimation}
430 Following \citet[p.~46]{prescher-em}, to find the reestimated
431 probability of a PCFG rule, we first find the new treebank frequencies
432 $f_{T_P}(tree)=P(tree)/P_s$, then for a rule $X' \rightarrow X$ we
433 divide the new frequencies of the trees which use this rule and by
434 those of the trees containing the node $X'$. $ROOT$ appears once per
435 tree, meaning we divide by $1$ per sentence\footnote{Assuming each
436 tree has frequency $1$.}, so $\hat{P}_{ROOT}(h)=\sum_{tree:ROOT
437 \rightarrow \SEAL{h} \text{ used in } tree} f_{T_P}(tree)=\sum_{tree:ROOT
438 \rightarrow \SEAL{h} \text{ used in } tree} P(tree)/P_s$, which turns into:
440 \begin{align*}
441 \hat{P}_{ROOT} (h) = \frac
442 {\sum_{s\in S} 1 / P_s \cdot \sum_{\LOC{h}\in s} P_{ROOT}(\LOC{h}) P_{INSIDE_s}(\SEAL{h}, 0, n_s)}
443 {|S|}
444 \end{align*}
449 \subsection{Alternate CNF-like rules}
450 Since the IO algorithm as described in \citet{lari-csl90} is made for
451 rules in Chomsky Normal Form (CNF), we have an alternate grammar
452 (table \ref{tab:cnf-like}) for running tests, where we don't have to sum
453 over the different $loc(h)$ in IO. This is not yet generalized to
454 include left-first attachment. It is also not quite CNF, since it
455 includes some unary rewrite rules.
457 \begin{table*}[ht]
458 \centering
459 \begin{tabular} % four left-aligned math tabs, one vertical line
460 { >{$}l<{$} >{$}l<{$} >{$}l<{$} | >{$}l<{$} }
461 \multicolumn{3}{c}{Rule} & \multicolumn{1}{c}{$P_{RULE}$ ($a[i,j,k]$ in \citet{lari-csl90})}\\
462 \hline{}
464 \RN{\GOR{h}} \rightarrow& \GOR{h} &\SEAL{a} &P_{STOP}(\neg stop|h, right, adj) \cdot P_{ATTACH}(a|h, right) \\
465 &&&\\
466 \RN{\GOR{h}} \rightarrow& \RN{\GOR{h}} &\SEAL{a} &P_{STOP}(\neg stop|h, right, non\text{-}adj) \cdot P_{ATTACH}(a|h, right) \\
467 &&&\\
468 \RGOL{h} \rightarrow& \GOR{h} &STOP &P_{STOP}(stop|h, right, adj) \\
469 &&&\\
470 \RGOL{h} \rightarrow& \RN{\GOR{h}} &STOP &P_{STOP}(stop|h, right, non\text{-}adj) \\
471 &&&\\
472 \LN{\RGOL{h}} \rightarrow& \SEAL{a} &\RGOL{h} &P_{STOP}(\neg stop|h, left, adj) \cdot P_{ATTACH}(a|h, left) \\
473 &&&\\
474 \LN{\RGOL{h}} \rightarrow& \SEAL{a} &\LN{\RGOL{h}} &P_{STOP}(\neg stop|h, left, non\text{-}adj) \cdot P_{ATTACH}(a|h, left) \\
475 &&&\\
476 \SEAL{h} \rightarrow& STOP &\RGOL{h} &P_{STOP}(stop|h, left, adj) \\
477 &&&\\
478 \SEAL{h} \rightarrow& STOP &\LN{\RGOL{h}} &P_{STOP}(stop|h, left, non\text{-}adj) \\
479 \end{tabular}
480 \caption{Alternate CFG rules (where a child node has an arrow below,
481 we use non-adjacent probabilities), defined for all words/POS-tags
482 $h$.}\label{tab:cnf-like}
483 \end{table*}
485 The inside probabilities are the same as those given in
486 \citet{lari-csl90}, with the following exceptions:
488 When calculating $P_{INSIDE}(\SMTR{h}, i, j)$ and summing through
489 possible rules which rewrite $\SMTR{h}$, if a rule is of the form
490 $\SMTR{h} \rightarrow STOP ~ \SDTR{h}$ or $\SMTR{h} \rightarrow
491 \SDTR{h} ~ STOP$, we add $P_{RULE}\cdot P_{INSIDE}(\SDTR{h}, i, j)$
492 (that is, rewrite for the same sentence range); and, as a consequence
493 of these unary rules: for ``terminal rules'' ($P_{ORDER}$) to be
494 applicable, not only must $i = j-1$, but also the left-hand side
495 symbol of the rule must be of the form $\GOR{h}$.
497 Similarly, the outside probabilities are the same as those for pure
498 CNF rules, with the exception that we add the unary rewrite
499 probabilities
500 \begin{align*}
501 \sum_{\SMTR{h}} [&P_{OUTSIDE}(\SMTR{h},i,j)\cdot P_{RULE}(\SMTR{h} \rightarrow \SDTR{h} ~ STOP) \\
502 + &P_{OUTSIDE}(\SMTR{h},i,j)\cdot P_{RULE}(\SMTR{h} \rightarrow STOP ~ \SDTR{h})]
503 \end{align*}
504 to $P_{OUTSIDE}(\SDTR{h},i,j)$ (eg. $f(s,t,i)$).
506 This grammar gave the same results for inside and outside
507 probabilities when run over our corpus.
509 \subsection{Initialization}
510 \citet{klein-thesis} describes DMV initialization using a ``harmonic
511 distribution'' for the initial probabilities, where the probability of
512 one word heading another is higher if they appear closer to one
513 another.
515 There are several ways this could be implemented. We initialized attachment
516 probabilities with the following formula:
518 \begin{align*}
519 P_{ATTACH}(a|h,right) = \frac
520 {\sum_{s \in S}\sum_{\LOC{h} \in s} \sum_{\LOC{a} \in s:loc(\LOC{a})>loc(\LOC{h})} 1/(loc(\LOC{a})-loc(\LOC{h})) + C_A}
521 {\sum_{s \in S}\sum_{\LOC{h} \in s} \sum_{\LOC{w} \in s:loc(\LOC{w})>loc(\LOC{h})} 1/(loc(\LOC{w})-loc(\LOC{h})) + C_A}
522 \end{align*}
524 The probability of stopping adjacently (left or right) was increased
525 whenever a word occured at a (left or right) sentence
526 border\footnote{For non-adjacent stopping we checked for occurence at
527 the second(-to-last) position.}:
529 \begin{align*}
530 f(stop:\LOC{h},left,adj)=\begin{cases}
531 C_S \text{, if } loc(\LOC{h}) = 0,\\
532 0 \text{, otherwise}
533 \end{cases}
534 \end{align*}
536 \begin{align*}
537 P_{STOP}(stop|h,left,adj) = \frac
538 {C_{M} + \sum_{s \in S}\sum_{\LOC{h} \in s} f(stop:\LOC{h},left,adj)}
539 {C_{M} + \sum_{s \in S}\sum_{\LOC{h} \in s} C_S+C_N}
540 \end{align*}
542 Tweaking the initialization constants $C_A, C_M, C_S$ and $C_N$
543 allowed us to easily try different inital distributions (eg. to modify
544 the initial grammar in the direction of a uniform distribution). The
545 next section discusses the effect of these.
547 \subsection{Results}
548 We compared the results of the implementation with a dependency parsed
549 version of the WSJ-10 corpus (converted from the manually annotated
550 version using a Perl script by Valentin Jijkoun). Since single word
551 sentences were not POS-tagged there, these were skipped. Also, the
552 dependency parsed WSJ-10 did not have ROOT nodes; so we calculated
553 precision and recall both without counting our ROOT links, and with
554 counting ROOT links, by adding these to the gold parses where
555 possible\footnote{221 parses in the dependency parsed WSJ-10 had
556 several tokens appearing as heads without appearing as dependents in
557 the same parse, here we skipped the parses when calculating with
558 ROOT links. Our gold standard also sometimes (for 1249 sentences)
559 had one dependent with two heads, these were skipped from
560 evaluation. We have not yet had time to run evaluation of undirected
561 dependencies.}.
563 \begin{table*}[hb]
564 \centering
565 \begin{tabular}{l|cc}
566 Model & \multicolumn{2}{c}{F1, directed} \\
567 \hline
568 LBRANCH/RHEAD & \multicolumn{2}{c}{33.6} \\
569 RANDOM & \multicolumn{2}{c}{30.1} \\
570 RBRANCH/LHEAD & \multicolumn{2}{c}{24.0} \\
571 K\&M's DMV & \multicolumn{2}{c}{43.2} \\[3pt]
572 Our DMV: & no ROOT & ROOT added \\
573 \hline
574 Uniform initial distribution & 24.9 & 22.1 \\
575 $C_A=0; C_S=1;C_N=0.1;C_M=1$ & 25.2 & 24.9 \\
576 $C_A=0; C_S=1;C_N=0.1;C_M=10$ & 26.9 & 31.9 \\
577 % $C_A=10;C_S=1;C_N=0.1;C_M=10$ & 25.0 & 30.1 \\
578 $C_A=10;C_S=1;C_N=3 ;C_M=10$ & 26.3 & 31.4 \\
579 $C_A=15;C_S=3;C_N=1 ;C_M=20$ & 27.2 & 32.2 \\
580 \end{tabular}
581 \caption{DMV results on the WSJ-10 for various initialization values (the right column is when we counted ROOT links added to the gold parses)}
582 \label{tab:dmv-wsj}
583 \end{table*}
585 We tried various values for the initialization constants; but it was
586 hard to find any clear pattern for what worked best.
587 %$C_N$ values of 10 or less worked well,
588 % todo
590 Table \ref{tab:dmv-wsj} shows the results of running our
591 implementation on the WSJ-10 corpus, compared with the dependency
592 parsed version, for given values of the initialization constants. The
593 EM hilltop seemed to be reached after about 30 iterations for our best
594 results (sooner for the worse results). As the table indicates
595 (especially the numbers for ``rooted'' evaluation), different initial
596 settings can lead to very different results\footnote{One alternative
597 idea we still haven't tried is basing initial DMV stopping
598 probabilities on \emph{distance} from sentence start/end, instead of
599 the ``binary'' test given above.}.
602 The numbers are not completely comparable though, since our corpus was
603 6268 sentences, while \citet{km-dmv} report 7422 sentences; also it is
604 not clear whether their DMV-experiment was run using automatically
605 induced word classes \citep[Schütze, 1995, in][p.~8]{km-dmv} or on the
606 tagset used to manually annotate the WSJ-10.
608 % underproposed = # $C_A=10;C_S=1;C_N=3 ;C_M=10$
609 % {(('NN', 1), ('DT', 0)): 347,
610 % (('NN', 2), ('DT', 0)): 148,
611 % (('NN', 3), ('DT', 2)): 136,
612 % (('NN', 4), ('DT', 3)): 144,
613 % (('NN', 5), ('DT', 4)): 128,
614 % (('NN', 6), ('DT', 5)): 124,
615 % (('NNP', 1), ('NNP', 0)): 358,
616 % (('NNP', 2), ('NNP', 0)): 125,
617 % (('NNP', 2), ('NNP', 1)): 174,
618 % (('NNS', 1), ('JJ', 0)): 124,
619 % (('ROOT', -1), ('NN', 0)): 100,
620 % (('ROOT', -1), ('NN', 1)): 106,
621 % (('ROOT', -1), ('NNP', 1)): 140,
622 % (('ROOT', -1), ('NNP', 2)): 104,
623 % (('VBD', 2), ('NN', 1)): 145,
624 % (('VBP', 2), ('NNS', 1)): 111,
625 % (('VBZ', 2), ('NN', 1)): 152,
626 % (('VBZ', 2), ('NNP', 1)): 107,
627 % (('VBZ', 3), ('NN', 2)): 109, }
629 In the stochastic grammars given by the model after EM, the POS class
630 $VBD$ (past tense verb) had the highest probability of being attached
631 to by ROOT, followed by $VBZ$ (3sg present) and $VBP$ (non-3sg
632 present); which looks promising. Interestingly, on fourth place we see
633 the tag $:$, and on sixth, $``$. Results might have been different
634 with such tokens stripped from the corpus and the evaluation standard,
635 we haven't tried this yet\footnote{A ROOT attachment to $:$ does make
636 sense though, in the same way that a comma might be the head of a
637 conjunction. Our evaluation standard almost always
638 % 97 per cent of the time..
639 puts colons as dependents of the phrase head (often an NP
640 fragment).}.
642 Just like \citet[p.~89]{klein-thesis} our algorithm chose determiners
643 to be the heads of nouns, rather than the opposite.
644 % Also, in general
645 % non-adjacent stopping was a lot more likely than adjacent stopping, a
646 % situation given by the DMV model.
649 % \section{The combined model (?)}
650 % \subsection{Results (?)}
653 \section{Conclusion}
654 From our implementations it seems that these processes capture some of
655 the statistical regularities in language, eg. the strong likelihood of
656 verbal sentence heads; however, our evaluations show that we did not
657 manage to replicate Klein \& Manning's parsing results, let alone
658 improve on them.
661 \nocite{lari-csl90}
662 \nocite{klein-thesis}
663 \nocite{km-dmv}
664 \bibliography{./statistical.bib}
665 \bibliographystyle{plainnat}
667 \end{document}