Update to remove faulty gold parses. Seems Done now :)
[dmvccm.git] / report / report.tex
blobd630da138d1100dc5a80db785c2bbddb9668fa23
1 % merged Kevin's & Emily's reports 2008-09-21 Sun 19:04 with Emerge
3 \documentclass[11pt,a4paper]{article}
4 \usepackage[utf8]{inputenc}
5 \usepackage[T1]{fontenc}
6 \usepackage{hyperref}
7 \usepackage{natbib}
9 \usepackage{pslatex}
10 \usepackage{pdfsync}
11 \pdfoutput=1
13 \usepackage{qtree}
14 \usepackage{amsmath}
15 \usepackage{amssymb}
17 \usepackage{avm}
18 \avmfont{\sc}
19 \avmoptions{sorted,active}
20 \avmvalfont{\rm}
21 \avmsortfont{\scriptsize\it}
23 \usepackage{array} % for a tabular with math in it
25 \title{The DMV and CCM models}
26 \author{Emily Morgan \& Kevin Brubeck Unhammer}
27 \date{15 September 2008}
29 \begin{document}
31 \maketitle
33 \tableofcontents
35 \section{Introduction}
36 \citet{klein-thesis} describes two models for the unsupervised
37 learning of syntactic structure. One is the Constituent Context Model
38 (CCM), which uses the context as well as the content of a substring to determine whether it
39 is a constituent. The context of a substring is defined to be the words
40 immediately preceding and following the substring. The intuition is that context can be a reliable cue for constituentness because when a long phrase is a constituent, there is often a similar case where a single word will serve as a constituent in the same context. Thus to determine the
41 likelihood that a substring is a constituent, this model considers both
42 the likelihood of the substring itself as a constituent and the
43 likelihood of its context as the context of a constituent.
45 The other is a Dependency Model with Valence (DMV), which uses a form
46 of Dependency Grammar. Here sentence structure is represented as pairs
47 of heads and dependents (arguments), although it is also possible to
48 represent this as a CFG (see \citet[p.~106]{klein-thesis}).
50 Apart from identity of the argument, probabilities are conditional on
51 direction of attachment. A valence condition is introduced by using
52 stopping probabilities: a head may take zero, one or more dependents
53 before ``stopping'', and there are different probabilities for
54 stopping without taking arguments (\emph{adjacently}) versus after
55 taking one or more (\emph{non-adjacently}).
57 Versions of the inside-outside algorithm for PCFG's \citep{lari-csl90}
58 are used for running estimation maximation on both these probabilistic
59 models. We tried implementing both the CCM and the DMV. Below we give
60 the details of how this was done, along with the results.
62 \section{The Constituent Context Model}
63 \subsection{Basics}
64 The CCM model \citep{klein-thesis, km-ccm} is a generative model over a set of Part Of Speech (POS) tags $\mathbb{T}$ and a beginning/end of sentence symbol $\diamond$.
65 The CCM model has two parameters:
66 \begin{itemize}
67 \item $P_{SPAN}(\alpha | c)$ is the probability of generating a span $\alpha$ given that it is/is not a constituent; $\alpha \in \mathbb{T}^*, c \in \{\text{true}, \text{false}\}$.
68 \item $P_{CONTEXT}(\beta | c)$ is the probability of generating a context $\beta$ given that the span it surrounds is/is not a constituent; $\beta \in (\mathbb{T}\cup\diamond)^2,c \in \{\text{true}, \text{false}\}$.
69 \end{itemize}
70 The probability of a sentence $s$ with bracketing $B$ is the probability of that bracketing times the probability of the span and the context for each substring of $s$. The probability of bracketings are the fixed distribution $P_\text{bin}$, which assigns equal weight to all bracketings that correspond to binary trees. For a sentence $s$ of length $n$, we index spaces between words in the sentence, including the leading space (0) and the trailing space ($n$). Let $\langle i,j\rangle_s$ be the substring from $i$ to $j$, and $\rangle i,j\langle_s$ be the context of that substring. Then we have: $$P(s,B)=P_{\text{bin}}(B)\displaystyle\prod_{\{\langle i,j\rangle_s\}}P_{\text{SPAN}}(\langle i,j\rangle_s|B_{ij})P_{\text{CONTEXT}}(\rangle i,j\langle_s|B_{ij})$$
72 \subsection{Reestimation}
73 We use the EM algorithm to induce our model parameters. Given the parameters $P_{\text{SPAN}}$ and $P_{\text{CONTEXT}}$, we wish to first find the likelihoods of completions for all sentences in a corpus given the current parameter values, and then find new parameters that maximize the corpus likelihood given the previously predicted completions. Because enumerating all bracketings for all sentences in computationally infeasible, we use a dynamic programming algorithm similar to the inside-outside algorithm.
75 We begin by rewriting the probability $P(s,B)$ given above:
76 $$P(s,B)=P_{\text{bin}}(B)\displaystyle\prod_{\{\langle i,j\rangle_s\}}P_{\text{SPAN}}(\langle i,j\rangle_s|\text{false})P_{\text{CONTEXT}}(\rangle i,j\langle_s|\text{false}) \times$$$$ \prod_{\{\langle i,j\rangle_s\in B\}}\frac{P_{\text{SPAN}}(\langle i,j\rangle_s|\text{true})P_{\text{CONTEXT}}(\rangle i,j\langle_s|\text{true})}{P_{\text{SPAN}}(\langle i,j\rangle_s|\text{false})P_{\text{CONTEXT}}(\rangle i,j\langle_s|\text{false})}$$
77 We can then condense this by letting $$\phi(i,j,s)=\frac{P_{\text{SPAN}}(\langle i,j\rangle_s|\text{true})P_{\text{CONTEXT}}(\rangle i,j\langle_s|\text{true})}{P_{\text{SPAN}}(\langle i,j\rangle_s|\text{false})P_{\text{CONTEXT}}(\rangle i,j\langle_s|\text{false})}$$
78 Further, note that $P_{\text{bin}}$ is constant for all bracketings of a sentence. We can thus also write $$K(s)=P_{\text{bin}}(B)\displaystyle\prod_{\{\langle i,j\rangle_s\}}P_{\text{SPAN}}(\langle i,j\rangle_s|\text{false})P_{\text{CONTEXT}}(\rangle i,j\langle_s|\text{false})$$ and $K(s)$ is constant for a given sentence.
79 We thus have a condensed formulation of $P(s,B)$:
80 $$P(s,B)=K(s) \prod_{\{\langle i,j\rangle_s\in B\}}\phi(i,j,s)$$
82 In the spirit of the inside-outside algorithm, we can use a dynamic programming algorithm to calculate the product of $\phi(i,j,s)$ for a sentence. We wish to calculate the products of the $\phi(i,j,s)$ both inside and outside a given interval $\langle i,j\rangle$. Let $\mathcal{T}(m)$ be the set of binary tree structures with $m$ leaves. We then define $$I(i,j,s)=\displaystyle\sum_{T\in \mathcal{T}(j-i)}\prod_{\langle a,b\rangle:\langle a-i,b-i\rangle\in T}\phi(a,b,s)$$
83 To calculate I, we want to find a recursive decomposition. In the case where $j=i$, there can be no constituents between $i$ and $j$, so we have $I(i,i,s)=0$. In the case where $j=i+1$, there is necessary exactly one constituent between $i$ and $j$ because all one-word substrings are constituents, so $I(i,i+1,s)=\phi(i,i+1,s)$. In the case where $j>i+1$, any constituent spanning from $i$ to $j$ must also contain subconstituents from $i$ to $k$ and from $k$ to $j$. Thus
84 $$I(i,j,s)=\displaystyle\sum_{i<k<j}\sum_{T\in\mathcal{T}(k-i)}\sum_{T'\in\mathcal{T}(j-k)}\phi(i,j,s)\prod_{\langle a,b\rangle:\langle a-i,b-i\rangle\in T}\phi(a,b,s)\prod_{\langle a',b'\rangle:\langle a'-k,b'-k\rangle\in T'}\phi(a',b',s)$$
85 $$=\phi(i,j,s)\displaystyle\sum_{i<k<j}\sum_{T\in\mathcal{T}(k-i)}\sum_{T'\in\mathcal{T}(j-k)}\prod_{\langle a,b\rangle:\langle a-i,b-i\rangle\in T}\phi(a,b,s)\prod_{\langle a',b'\rangle:\langle a'-k,b'-k\rangle\in T'}\phi(a',b',s)$$
86 $$=\phi(i,j,s)\displaystyle\sum_{i<k<j}I(i,k,s)I(k,j,s)$$
88 In summary, we have a recursive decomposition for I:
89 $$I(i,j,s) = \begin{cases}
90 \phi(i,j,s)\displaystyle\sum_{i<k<j}I(i,k,s)I(k,j,s) & \text{if } j-i>1 \\
91 \phi(i,j,s) & \text{if } j-i=1 \\
92 0 & \text{if } j-i = 0
93 \end{cases}$$
95 Similarly, we define the outside probability
96 $$O(i,j,s) = \begin{cases}
97 \displaystyle\sum_{0\leq k<i}I(k,i,s)\phi(k,j,s)O(k,j,s) + \sum_{j<k\leq n} I(j,k,s)\phi(i,k,s)O(i,k,s) &
98 \text{if } j-i < n \\
99 1 & \text{if } j-i = n
100 \end{cases}$$
102 To reestimate the parameters $P_{\text{SPAN}}$ and $P_{\text{CONTEXT}}$, we wish to know $P_{\text{BRACKET}}(i,j,s)$, the fraction of bracketings of $s$ that contain $\langle i,j\rangle$ as a constituent:
103 $$P_{\text{BRACKET}}(i,j,s)=\frac{\displaystyle\sum_{\{B:B_{ij}=\text{true}\}}P(s,B)}{\displaystyle\sum_{B}P(s,B)}$$
104 We can rewrite this in terms of our inside and outside probabilities:
105 $$\displaystyle\sum_{\{B:B_{ij}=\text{true}\}}P(s,B)=K(s)I(i,j,s)O(i,j,s)$$
107 $$\displaystyle\sum_{B}P(s,B)=K(s)I(0,n,s)O(0,n,s)$$
108 But the $K(s)$ terms cancel, and $O(0,n,s)=1$, so we find that $$P_{\text{BRACKET}}(i,j,s)=\frac{I(i,j,s)O(i,j,s)}{I(0,n,s)}$$
110 From $P_{\text{BRACKET}}$ we can calculate the next iteration of the model parameters $P_{\text{SPAN}}$ and $P_{\text{CONTEXT}}$. First, for a sentence $s$ of length $n_s$, note that the number of spans in the sentence is $N_s=(n_s+1)(n_s+2)/2$, which includes the spans of length 0. Furthemore, let $\#_s(\gamma)$ be the number of times that $\gamma$ appears in $s$, where $\gamma$ can be either a span or a context.
112 We begin with
113 $$P_{\text{SPAN}}(\alpha|c)=\displaystyle\sum_{s\in S}P(\alpha|c,s)P(s)=\frac{1}{|S|}\sum_{s\in S}P(\alpha|c,s)$$
114 This reduction works for $P_{\text{CONTEXT}}$ as well, so we have reduced the problem of calculating $P_{\text{SPAN}}$ and $P_{\text{CONTEXT}}$ on the corpus as a whole to calculating it for each sentence.
116 Let us begin by calculating $P_{\text{SPAN}}(\alpha|\text{true},s)$. By Bayes' Theorem, $$P_{\text{SPAN}}(\alpha|\text{true},s)=\frac{P_{\text{SPAN}}(\text{true}|\alpha,s)P_{\text{SPAN}}(\alpha|s)}{P_{\text{SPAN}}(\text{true}|s)}$$
117 We can compute each of the quantities on the right hand side:
118 $$P_{\text{SPAN}}(\alpha|s)=\frac{\#_s(\alpha)}{N_s}$$
119 $$P_{\text{SPAN}}(\text{true}|s)=\frac{2n_s-1}{N_s}$$
120 $$P_{\text{SPAN}}(\text{true}|\alpha,s)=\frac{1}{\#_s(\alpha)}\displaystyle\sum_{\langle i,j\rangle=\alpha}P_{\text{BRACKET}}(i,j,s)$$
122 Thus we find that
123 $$P_{\text{SPAN}}(\alpha|\text{true},s)=\frac{1}{\#_s(\alpha)}\displaystyle\sum_{\langle i,j\rangle=\alpha}P_{\text{BRACKET}}(i,j,s)\frac{\#_s(\alpha)}{N_s}\frac{N_s}{2n_s-1}=\frac{1}{2n_s-1}\displaystyle\sum_{\langle i,j\rangle=\alpha}P_{\text{BRACKET}}(i,j,s)$$
125 A similar calculation shows that $$P_{\text{CONTEXT}}(\beta|\text{true},s)=\frac{1}{2n_s-1}\displaystyle\sum_{\rangle i,j\langle=\beta}P_{\text{BRACKET}}(i,j,s)$$
127 Now let us consider $P_{\text{SPAN}}(\alpha|\text{false},s)$. Again by Bayes' Theorem we have $$P_{\text{SPAN}}(\alpha|\text{false},s)=\frac{P_{\text{SPAN}}(\text{false}|\alpha,s)P_{\text{SPAN}}(\alpha|s)}{P_{\text{SPAN}}(\text{false}|s)}$$
129 $P_{\text{SPAN}}(\alpha|s)$ is as before. $P_{\text{SPAN}}(\text{false}|s)=1-P_{\text{SPAN}}(\text{true}|s)$. And $P_{\text{SPAN}}(\text{false}|\alpha,s)=1-P_{\text{SPAN}}(\text{true}|\alpha,s)$. Thus
130 $$P_{\text{SPAN}}(\alpha|\text{false},s)=\frac{(1-\frac{1}{\#_s(\alpha)}\displaystyle\sum_{\langle i,j\rangle=\alpha}P_{\text{BRACKET}}(i,j,s))\frac{2n_s-1}{N_s}}{1-\frac{2n_s-1}{N_s}}$$$$=\frac{\#(\alpha)-\Sigma_{<i,j>=\alpha}P_{BRACKET}(i,j,s)}{N_s-2n_s+1}$$
132 Similarly, we have $$P_{CONTEXT}(\beta|false,s) = \frac{\#(\beta)-\Sigma_{>i,j<=\beta}P_{BRACKET}(i,j,s)}{N_s-2n_s+1}$$
134 From these, we can calculate the new parameters $P_{\text{SPAN}}$ and $P_{\text{CONTEXT}}$ as above.
136 \subsection{Initialization}
137 We initialize the model with a distribution $P_{\text{SPLIT}}$ over trees. $P_{\text{SPLIT}}$ is calculated as follows: for a string of $n$ words, choose a split point at random, then recursively build trees on either side of that split by further random splitting. We initialize the model using $P_{\text{BRACKET}}(i,j,s)=P_{\text{SPLIT}}(i,j,n_s)$. However, we also apply smoothing, adding small constant counts $m_\text{true}$ and $m_\text{false}$ of each span and context as a constituent and distituent, respectively. $m_\text{true}$ should be smaller than $m_\text{false}$, as there are more distituent spans than constituent spans. However, \citet{klein-thesis} and \citet{km-ccm} give different values for these constants, and in any case they must vary in relation to the size of the corpus used. Let $\#_\alpha$ and $\#_\beta$ be the number of unique spans and contexts in the corpus. We initialize the model:
138 \begin{itemize}
139 \item $P_{SPAN}(\alpha|true)= \frac {(\displaystyle\sum_{s\in S}\displaystyle\sum_{<i,j>=\alpha}P_{\text{SPLIT}}(i,j,n_s))+m_\text{true}}
140 {(\displaystyle\sum_{s\in S}2n_s-1)+m_\text{true} \#_\alpha}$
141 \item $P_{SPAN}(\alpha|false)= \frac {(\displaystyle\sum_{s\in S}\displaystyle\sum_{<i,j>=\alpha}1-P_{\text{SPLIT}}(i,j,n_s))+m_\text{false}}
142 {(\displaystyle\sum_{s\in S}N_s-(2n_s-1))+m_\text{false} \#_\alpha}$
143 \item $P_{CONTEXT}(\beta|true)= \frac {(\displaystyle\sum_{s\in S}\displaystyle\sum_{>i,j<=\beta}P_{\text{SPLIT}}(i,j,n_s))+m_\text{true}}
144 {(\displaystyle\sum_{s\in S}2n_s-1)+m_\text{true} \#_\beta}$
145 \item $P_{CONTEXT}(\beta|false)= \frac {(\displaystyle\sum_{s\in S}\displaystyle\sum_{>i,j<=\beta}1-P_{\text{SPLIT}}(i,j,s))+m_\text{false}}
146 {(\displaystyle\sum_{s\in S}N_s-(2n_s-1))+m_\text{false} \#_\beta}$
147 \end{itemize}
149 \subsection{Smoothing}
150 In addition to the smoothing used during initialization, some further smoothing is necessary during reestimation. In our initial calculations, we made the critical step of factoring out $\displaystyle\prod_{\{\langle i,j\rangle_s\}}P_{\text{SPAN}}(\langle i,j\rangle_s|\text{false})P_{\text{CONTEXT}}(\rangle i,j\langle_s|\text{false})$. This factorization assumes that $P_{\text{SPAN}}(\langle i,j\rangle_s|\text{false})$ and $P_{\text{CONTEXT}}(\rangle i,j\langle_s|\text{false})$ are nonzero for all spans and contexts. Particularly in the case of small corpora, this assumption is not always valid. For example, if a span $\alpha$ always appears as a full sentence--never as a proper substring--then its $P_{\text{SPAN}}(\alpha|\text{false})$ will be zero because a full sentence is never a distituent. Or if a word $a$ only appears as the second word of a sentence, then the context $(\diamond,a)$ will have $P_{\text{CONTEXT}}((\diamond,a)|\text{false})=0$ because the one-word span that it contains will always be a constituent.
152 To avoid this problem, we can use further smoothing during reestimation. During each reestimation of $P_{\text{SPAN}}$ and $P_{\text{CONTEXT}}$, we add to each span and context small constant counts of $m_\text{true}$ as a constituent and $m_\text{false}$ as a distituent.
154 \subsection{Results}
155 Our implementation of the CCM model has not yet been tested on the Wall Street Journal corpus. On small toy corpora, it behaves unpredictably. For instance, on the corpus consisting of the sentences $a b c$ and $b c a$, the model gives higher probabilities to the spans $a b$ and $c a$ as constituents than it does to $b c$. However, this is the opposite of what we would expect. We have not yet been able to identify the source of this problem.
157 \section{A Dependency Model with Valence}
158 \newcommand{\LOC}[1]{\textbf{#1}}
159 \newcommand{\GOR}[1]{\overrightarrow{#1}}
160 \newcommand{\RGOL}[1]{\overleftarrow{\overrightarrow{#1}}}
161 \newcommand{\SEAL}[1]{\overline{#1}}
162 \newcommand{\LGOR}[1]{\overrightarrow{\overleftarrow{#1}}}
163 \newcommand{\GOL}[1]{\overleftarrow{#1}}
164 \newcommand{\LN}[1]{\underleftarrow{#1}}
165 \newcommand{\RN}[1]{\underrightarrow{#1}}
166 \newcommand{\XI}{\lessdot}
167 \newcommand{\XJ}{\gtrdot}
168 \newcommand{\SMTR}[1]{\dddot{#1}}
169 \newcommand{\SDTR}[1]{\ddot{#1}}
171 The DMV is a \emph{head-outward process}. Given a certain head (eg. a
172 verb), we first attach possible arguments in one direction (eg. nouns
173 to the right), then stop in that direction, then possibly attach
174 arguments in the other direction, then stop. The probabilities of a
175 certain head attaching any arguments in a direction sum to one
176 \citep[see~$P(D(h))$~in][p.~87]{klein-thesis} -- modelled as a PCFG,
177 however \citep[p.~83,~figure~6.4]{klein-thesis}, probabilities may sum
178 to less than one given a certain left-hand side symbol.
180 In the following sections we try to flesh out the details of the
181 inside-outside algorithm \citep{lari-csl90} applied to the DMV model
182 of \citet{klein-thesis}. We have three parameters which are
183 reestimated using the inside-outside algorithm:
185 \begin{itemize}
186 \item $P_{ATTACH}(a|h,dir)$ is the probability of attaching an
187 argument with tag $a$, given that we are attaching in direction
188 $dir$ from head $h$. Here $a,h \in \mathbb{T},
189 dir\in\{left,right\}$.
190 \item $P_{STOP}(stop|h,dir,adj)$ is the probability of stopping (which
191 happens after argument attachment), given that we are attaching from
192 head $h$ in direction $dir$. Adjacency $adj$ is whether or not we
193 have attached anything yet in that direction.
194 \item $P_{ROOT}(h)$ is the probability of the special ROOT symbol
195 heading $h$; in the original model this is modelled using a right
196 stop, a left stop, and an attachment
197 \citep[p.~105,~figure~A.2]{klein-thesis}; we conflated these
198 into one additional parameter.
199 \end{itemize}
201 There is also a parameter $P_{ORDER}$, signifying the probability of a
202 head $w$ attaching to the right first, or to the left first. In the
203 results given below,
204 \[ \forall{w}[P_{ORDER}(right\text{-}first|w)=1.0] \]
205 (this parameter is not reestimated).
207 \subsection{Note on notation}
208 $i, j, k$ are sentence positions (between words), where $i$ and $j$
209 are always the start and end, respectively, for what we're calculating
210 ($k$ is between $i$ and $j$ for $P_{INSIDE}$, to their right or left
211 for $P_{OUTSIDE}$). $s \in S$ are sentences in the corpus. $\LOC{w}$
212 is a word token (actually POS-token) of type $w$ at a certain sentence
213 location. If $\LOC{w}$ is between $i$ and $i+1$, $loc(\LOC{w})=i$
214 following \citet{klein-thesis}, meaning $i$ is adjacent to $\LOC{w}$
215 on the left, while $j=loc(\LOC{w})+1$ means that $j$ is adjacent to
216 $\LOC{w}$ on the right. To simplify, $loc_l(\LOC{w}):=loc(\LOC{w})$ and
217 $loc_r(\LOC{w}):=loc(\LOC{w})+1$. We write $\LOC{h}$ if this is a head
218 in the rule being used, $\LOC{a}$ if it is an attached argument.
220 There are some notational differences between the thesis
221 \citet{klein-thesis} and the ACL paper \citet{km-dmv}:
223 \begin{tabular}{cc}
224 Paper: & Thesis: \\
225 $w$ & $\GOR{w}$ \\
226 $w\urcorner$ & $\RGOL{w}$ \\
227 $\ulcorner{w}\urcorner$ & $\SEAL{w}$ \\
228 \end{tabular}
230 We use $\SMTR{w}$ (or $\SDTR{w}$) to signify one of either $w, \GOR{w},
231 \RGOL{w}, \LGOR{w}, \GOL{w}$ or $\SEAL{w}$\footnote{This means that
232 $\SMTR{\LOC{w}}$ is the triplet of the actual POS-tag, its sentence
233 location as a token, and the ``level of seals''.}.
236 \subsection{Inside probabilities}
237 $P_{INSIDE}$ is defined in \citet[pp.~106-108]{klein-thesis}, the only
238 thing we need to add is that for right attachments,
239 $i \leq loc_l(w)<k \leq loc_l(\LOC{a})<j$ while for left attachments,
240 $i \leq loc_l(\LOC{a})<k \leq loc_l(w)<j$.
243 \subsubsection{Sentence probability}
244 $P_s$ is the sentence probability, based on
245 \citet[p.~38]{lari-csl90}. Since the ROOT rules are different from the
246 rest, we sum them explicitly in this definition:
247 \begin{align*}
248 P_s = \sum_{\LOC{w} \in s} P_{ROOT}(\LOC{w}) P_{INSIDE}(\SEAL{\LOC{w}}, 0, n_s)
249 \end{align*}
251 \subsection{Outside probabilities}
253 \begin{align*}
254 P_{OUTSIDE_s}(ROOT, i, j) = \begin{cases}
255 1.0 & \text{ if $i = 0$ and $j = n_s$,}\\
256 0.0 & \text{ otherwise}
257 \end{cases}
258 \end{align*}
260 For $P_{OUTSIDE}(\SEAL{w}, i, j)$, $w$ is attached to under something
261 else ($\SEAL{w}$ is what we elsewhere call $\SEAL{a}$). Adjacency is
262 thus calculated on the basis of $h$, the head of the rule. If we are
263 attached to from the left we have $i \leq loc_l(\LOC{w}) < j \leq loc_l(\LOC{h}) < k$, while
264 from the right we have $k \leq loc_l(\LOC{h}) < i \leq loc_l(\LOC{w}) < j$:
265 \begin{align*}
266 P_{OUTSIDE}&(\SEAL{\LOC{w}}, i, j) = \\
267 & P_{ROOT}(w) P_{OUTSIDE}(ROOT, i, j) + \\
268 & [ \sum_{k > j} ~ \sum_{\LOC{h}:j\leq loc_l(\LOC{h})<k} \sum_{\SMTR{\LOC{h}} \in \{\RGOL{\LOC{h}},\GOL{\LOC{h}}\}} P_{STOP}(\neg stop|h, left, adj(j, \LOC{h})) P_{ATTACH}(w|h, left) \\
269 & \qquad \qquad \qquad \qquad \qquad P_{OUTSIDE}(\SMTR{\LOC{h}}, i, k) P_{INSIDE}(\SMTR{\LOC{h}}, j, k) ] ~ + \\
270 & [ \sum_{k < i} ~ \sum_{\LOC{h}:k\leq loc_l(\LOC{h})<i} \sum_{\SMTR{\LOC{h}} \in \{\LGOR{\LOC{h}},\GOR{\LOC{h}}\}} P_{STOP}(\neg stop|h, right, adj(i, \LOC{h})) P_{ATTACH}(w|h, right) \\
271 & \qquad \qquad \qquad \qquad \qquad P_{INSIDE}(\SMTR{\LOC{h}}, k, i) P_{OUTSIDE}(\SMTR{\LOC{h}}, k, j) ]
272 \end{align*}
274 For $\RGOL{w}$ we know it is either under a left stop rule or it is
275 the right daughter of a left attachment rule ($k \leq loc_l(\LOC{a}) <
276 i \leq loc_l(\LOC{w}) < j$), and these are adjacent if the start point
277 ($i$) equals $loc_l(\LOC{w})$:
278 \begin{align*}
279 P_{OUTSIDE}(\RGOL{\LOC{w}}, i, j) = & P_{STOP}(stop|w, left, adj(i,
280 \LOC{w}))P_{OUTSIDE}(\SEAL{\LOC{w}}, i, j) ~ + \\
281 & [ \sum_{k < i} ~ \sum_{\LOC{a}:k\leq loc_l(\LOC{a})<i} P_{STOP}(\neg stop|w, left, adj(i, \LOC{w})) P_{ATTACH}(a|w, left) \\
282 & ~~~~~~~~~~~~~~~~~~~~~~~~~ P_{INSIDE}(\SEAL{\LOC{a}}, k, i) P_{OUTSIDE}(\RGOL{\LOC{w}}, k, j) ]
283 \end{align*}
285 For $\GOR{w}$ we are either under a right stop or the left daughter of
286 a right attachment rule ($i \leq loc_l(\LOC{w}) < j \leq
287 loc_l(\LOC{a}) < k$), adjacent iff the the end point ($j$) equals
288 $loc_r(\LOC{w})$:
289 \begin{align*}
290 P_{OUTSIDE}(\GOR{\LOC{w}}, i, j) = & P_{STOP}(stop|w, right, adj(j,
291 \LOC{w}))P_{OUTSIDE}(\RGOL{\LOC{w}}, i, j) ~ + \\
292 & [ \sum_{k > j} ~ \sum_{\LOC{a}:j\leq loc_l(\LOC{a})<k} P_{STOP}(\neg stop|w, right, adj(j, \LOC{w})) P_{ATTACH}(a|w, right) \\
293 & ~~~~~~~~~~~~~~~~~~~~~~~~~ P_{OUTSIDE}(\GOR{\LOC{w}}, i, k) P_{INSIDE}(\SEAL{\LOC{a}}, j, k) ]
294 \end{align*}
296 $\GOL{w}$ is just like $\RGOL{w}$, except for the outside probability
297 of having a stop above, where we use $\LGOR{w}$:
298 \begin{align*}
299 P_{OUTSIDE}(\GOL{\LOC{w}}, i, j) = & P_{STOP}(stop|w, left, adj(i,
300 \LOC{w}))P_{OUTSIDE}(\LGOR{\LOC{w}}, i, j) ~ + \\
301 & [ \sum_{k < i} ~ \sum_{\LOC{a}:k\leq loc_l(\LOC{a})<i} P_{STOP}(\neg stop|w, left, adj(i, \LOC{w})) P_{ATTACH}(a|w, left) \\
302 & ~~~~~~~~~~~~~~~~~~~~~~~~~ P_{INSIDE}(\SEAL{\LOC{a}}, k, i) P_{OUTSIDE}(\GOL{\LOC{w}}, k, j) ]
303 \end{align*}
305 $\LGOR{w}$ is just like $\GOR{w}$, except for the outside probability
306 of having a stop above, where we use $\SEAL{w}$:
307 \begin{align*}
308 P_{OUTSIDE}(\LGOR{\LOC{w}}, i, j) = & P_{STOP}(stop|w, right, adj(j,
309 \LOC{w}))P_{OUTSIDE}(\SEAL{\LOC{w}}, i, j) ~ + \\
310 & [ \sum_{k > j} ~ \sum_{\LOC{a}:j\leq loc_l(\LOC{a})<k} P_{STOP}(\neg stop|w, right, adj(j, \LOC{w})) P_{ATTACH}(a|w, right) \\
311 & ~~~~~~~~~~~~~~~~~~~~~~~~~ P_{OUTSIDE}(\LGOR{\LOC{w}}, i, k) P_{INSIDE}(\SEAL{\LOC{a}}, j, k) ]
312 \end{align*}
315 \subsection{DMV Reestimation}
316 $P_{INSIDE}$ and $P_{OUTSIDE}$ (the corpus frequencies given a certain
317 probability distribution) give us the counts we need to reestimate our
318 model parameters.
320 \subsubsection{$c$ and $w$}
321 First we need some helper functions.
322 $c_s(\SMTR{\LOC{w}} : i, j)$ is ``the expected fraction of parses of
323 $s$ with a node labeled $\SMTR{w}$ extending from position $i$ to
324 position $j$'' \citep[p.~88]{klein-thesis}, here defined to equal
325 $v_{q}$ of \citet[p.~41]{lari-csl90}\footnote{In terms of regular EM,
326 this is the count of trees ($f_{T_q}(x)$ in
327 \citet[p.~46]{prescher-em}) in which the node extended from $i$ to
328 $j$.}:
329 \begin{align*}
330 c_s(\SMTR{\LOC{w}} : i, j) = P_{INSIDE_s}(\SMTR{\LOC{w}}, i, j) P_{OUTSIDE_s}(\SMTR{\LOC{w}}, i, j) / P_s
331 \end{align*}
333 $w_s$ is $w_{q}$ from \citet[p.~41]{lari-csl90}, generalized to $\SMTR{h}$ and $dir$:
334 \begin{align*}
335 w_s(\SEAL{a} & : \SMTR{\LOC{h}}, left, i, j) = \\
336 & 1/P_s \sum_{k:i<k<j} ~ \sum_{\LOC{a}:i\leq loc_l(\LOC{a})<k}
337 & P_{STOP}(\neg stop|h, left, adj(k, \LOC{h})) P_{ATTACH}(a|h, left) \\
338 & & P_{INSIDE_s}(\SEAL{\LOC{a}}, i, k) P_{INSIDE_s}(\SMTR{\LOC{h}}, k, j) P_{OUTSIDE_s}(\SMTR{\LOC{h}}, i, j)
339 \end{align*}
340 \begin{align*}
341 w_s(\SEAL{a} & : \SMTR{\LOC{h}}, right, i, j) = \\
342 & 1/P_s \sum_{k:i<k<j} ~ \sum_{\LOC{a}:k\leq loc_l(\LOC{a})<j}
343 & P_{STOP}(\neg stop|h, right, adj(k, \LOC{h})) P_{ATTACH}(a|h, right) \\
344 & & P_{INSIDE_s}(\SMTR{\LOC{h}}, i, k) P_{INSIDE_s}(\SEAL{\LOC{a}}, k, j) P_{OUTSIDE_s}(\SMTR{\LOC{h}}, i, j)
345 \end{align*}
347 Let $\hat{P}$ be the new STOP/ATTACH-probabilities (since the old $P$
348 are used in $P_{INSIDE}$ and $P_{OUTSIDE}$).
350 \subsubsection{Attachment reestimation}
352 This is based on $\hat{a}$, given in \citet[p.~41]{lari-csl90}.
354 When attaching to the left,
355 $i<loc_l(\LOC{h})$ since we want trees with at least one attachment:
356 \begin{align*}
357 \hat{P}_{ATTACH} (a | h, left) = \frac
358 { \sum_{s \in S} \sum_{\LOC{h} \in s} \sum_{i<loc_l(\LOC{h})} \sum_{j\geq loc_r(\LOC{h})} [w_s(\SEAL{a} : \GOL{\LOC{h}}, left, i, j) + w_s(\SEAL{a} : \RGOL{\LOC{h}}, left, i, j) ]}
359 { \sum_{s \in S} \sum_{\LOC{h} \in s} \sum_{i<loc_l(\LOC{h})} \sum_{j\geq loc_r(\LOC{h})} [c_s(\GOL{\LOC{h}} : i, j) + c_s(\RGOL{\LOC{h}} : i, j) ]}
360 \end{align*}
362 Below, $j>loc_r(\LOC{h})$ since we want at least one attachment to the right:
363 \begin{align*}
364 \hat{P}_{ATTACH} (a | h, right) = \frac
365 { \sum_{s \in S} \sum_{\LOC{h} \in s} \sum_{i\leq loc_l(\LOC{h})} \sum_{j>loc_r(\LOC{h})} [w_s(\SEAL{a} : \LGOR{\LOC{h}}, right, i, j) + w_s(\SEAL{a} : \GOR{\LOC{h}}, right, i, j) ]}
366 { \sum_{s \in S} \sum_{\LOC{h} \in s} \sum_{i\leq loc_l(\LOC{h})} \sum_{j>loc_r(\LOC{h})} [c_s(\LGOR{\LOC{h}} : i, j) + c_s(\GOR{\LOC{h}} : i, j) ]}
367 \end{align*}
369 For the first/lowest attachments, $w_s$ and $c_s$ have zero probability
370 where $i<loc_l(\LOC{h})$ (for $\GOR{h}$) or $j>loc_r(\LOC{h})$ (for $\GOL{h}$),
371 this is implicit in $P_{INSIDE}$.
375 \subsubsection{Stop reestimation}
376 The following is based on \citet[p.~88]{klein-thesis}. For the
377 non-adjacent rules, $i<loc_l(\LOC{h})$ on the left and
378 $j>loc_r(\LOC{h})$ on the right, while for the adjacent rules these
379 are equal (respectively)\footnote{For left-stopping with right-first
380 attachments, $j \geq loc_r(\LOC{h})$ since we may have
381 right-attachments below, for left-stopping with left-first
382 attachments we only need $j=loc_r(\LOC{h})$.}.
384 To avoid some redundancy below, define a helper function $\hat{d}$ as follows:
385 \begin{align*}
386 \hat{d}(\SMTR{h},\XI,\XJ) =
387 { \sum_{s \in S} \sum_{\SMTR{\LOC{h}}:\LOC{h} \in s} \sum_{i:i \XI loc_l(\LOC{h})} \sum_{j:j \XJ loc_r(\LOC{h})} c_s(\SMTR{\LOC{h}} : i, j) }
388 \end{align*}
390 Then these are our reestimated stop probabilities:
391 \begin{align*}
392 \hat{P}_{STOP} (STOP|h, left, non\text{-}adj) = \frac
393 { \hat{d}(\SEAL{h},<,\geq) + \hat{d}(\LGOR{h},<,=) }
394 { \hat{d}(\RGOL{h},<,\geq) + \hat{d}(\GOL{h},<,=) }
395 \end{align*}
397 \begin{align*}
398 \hat{P}_{STOP} (STOP|h, left, adj) = \frac
399 { \hat{d}(\SEAL{h},=,\geq) + \hat{d}(\LGOR{h},=,=) }
400 { \hat{d}(\RGOL{h},=,\geq) + \hat{d}(\GOL{h},=,=) }
401 \end{align*}
403 \begin{align*}
404 \hat{P}_{STOP} (STOP|h, right, non\text{-}adj) = \frac
405 { \hat{d}(\RGOL{h},=,>) + \hat{d}(\SEAL{h},\leq,>) }
406 { \hat{d}(\GOR{h},=,>) + \hat{d}(\LGOR{h},\leq,>) }
407 \end{align*}
409 \begin{align*}
410 \hat{P}_{STOP} (STOP|h, right, adj) = \frac
411 { \hat{d}(\RGOL{h},=,=) + \hat{d}(\SEAL{h},\leq,=) }
412 { \hat{d}(\GOR{h},=,=) + \hat{d}(\LGOR{h},\leq,=) }
413 \end{align*}
416 \subsubsection{Root reestimation}
417 Following \citet[p.~46]{prescher-em}, to find the reestimated
418 probability of a PCFG rule, we first find the new treebank frequencies
419 $f_{T_P}(tree)=P(tree)/P_s$, then for a rule $X' \rightarrow X$ we
420 divide the new frequencies of the trees which use this rule and by
421 those of the trees containing the node $X'$. $ROOT$ appears once per
422 tree, meaning we divide by $1$ per sentence\footnote{Assuming each
423 tree has frequency $1$.}, so $\hat{P}_{ROOT}(h)=\sum_{tree:ROOT
424 \rightarrow \SEAL{h} \text{ used in } tree} f_{T_P}(tree)=\sum_{tree:ROOT
425 \rightarrow \SEAL{h} \text{ used in } tree} P(tree)/P_s$, which turns into:
427 \begin{align*}
428 \hat{P}_{ROOT} (h) = \frac
429 {\sum_{s\in S} 1 / P_s \cdot \sum_{\LOC{h}\in s} P_{ROOT}(\LOC{h}) P_{INSIDE_s}(\SEAL{h}, 0, n_s)}
430 {|S|}
431 \end{align*}
436 \subsection{Alternate CNF-like rules}
437 Since the IO algorithm as described in \citet{lari-csl90} is made for
438 rules in Chomsky Normal Form (CNF), we have an alternate grammar
439 (table \ref{tab:cnf-like}) for running tests, where we don't have to sum
440 over the different $loc(h)$ in IO. This is not yet generalized to
441 include left-first attachment. It is also not quite CNF, since it
442 includes some unary rewrite rules.
444 \begin{table*}[ht]
445 \centering
446 \begin{tabular} % four left-aligned math tabs, one vertical line
447 { >{$}l<{$} >{$}l<{$} >{$}l<{$} | >{$}l<{$} }
448 \multicolumn{3}{c}{Rule} & \multicolumn{1}{c}{$P_{RULE}$ ($a[i,j,k]$ in \citet{lari-csl90})}\\
449 \hline{}
451 \RN{\GOR{h}} \rightarrow& \GOR{h} &\SEAL{a} &P_{STOP}(\neg stop|h, right, adj) \cdot P_{ATTACH}(a|h, right) \\
452 &&&\\
453 \RN{\GOR{h}} \rightarrow& \RN{\GOR{h}} &\SEAL{a} &P_{STOP}(\neg stop|h, right, non\text{-}adj) \cdot P_{ATTACH}(a|h, right) \\
454 &&&\\
455 \RGOL{h} \rightarrow& \GOR{h} &STOP &P_{STOP}(stop|h, right, adj) \\
456 &&&\\
457 \RGOL{h} \rightarrow& \RN{\GOR{h}} &STOP &P_{STOP}(stop|h, right, non\text{-}adj) \\
458 &&&\\
459 \LN{\RGOL{h}} \rightarrow& \SEAL{a} &\RGOL{h} &P_{STOP}(\neg stop|h, left, adj) \cdot P_{ATTACH}(a|h, left) \\
460 &&&\\
461 \LN{\RGOL{h}} \rightarrow& \SEAL{a} &\LN{\RGOL{h}} &P_{STOP}(\neg stop|h, left, non\text{-}adj) \cdot P_{ATTACH}(a|h, left) \\
462 &&&\\
463 \SEAL{h} \rightarrow& STOP &\RGOL{h} &P_{STOP}(stop|h, left, adj) \\
464 &&&\\
465 \SEAL{h} \rightarrow& STOP &\LN{\RGOL{h}} &P_{STOP}(stop|h, left, non\text{-}adj) \\
466 \end{tabular}
467 \caption{Alternate CFG rules (where a child node has an arrow below,
468 we use non-adjacent probabilities), defined for all words/POS-tags
469 $h$.}\label{tab:cnf-like}
470 \end{table*}
472 The inside probabilities are the same as those given in
473 \citet{lari-csl90}, with the following exceptions:
475 When calculating $P_{INSIDE}(\SMTR{h}, i, j)$ and summing through
476 possible rules which rewrite $\SMTR{h}$, if a rule is of the form
477 $\SMTR{h} \rightarrow STOP ~ \SDTR{h}$ or $\SMTR{h} \rightarrow
478 \SDTR{h} ~ STOP$, we add $P_{RULE}\cdot P_{INSIDE}(\SDTR{h}, i, j)$
479 (that is, rewrite for the same sentence range); and, as a consequence
480 of these unary rules: for ``terminal rules'' ($P_{ORDER}$) to be
481 applicable, not only must $i = j-1$, but also the left-hand side
482 symbol of the rule must be of the form $\GOR{h}$.
484 Similarly, the outside probabilities are the same as those for pure
485 CNF rules, with the exception that we add the unary rewrite
486 probabilities
487 \begin{align*}
488 \sum_{\SMTR{h}} [&P_{OUTSIDE}(\SMTR{h},i,j)\cdot P_{RULE}(\SMTR{h} \rightarrow \SDTR{h} ~ STOP) \\
489 + &P_{OUTSIDE}(\SMTR{h},i,j)\cdot P_{RULE}(\SMTR{h} \rightarrow STOP ~ \SDTR{h})]
490 \end{align*}
491 to $P_{OUTSIDE}(\SDTR{h},i,j)$ (eg. $f(s,t,i)$).
493 This grammar gave the same results for inside and outside
494 probabilities when run over our corpus.
496 \subsection{Initialization}
497 \citet{klein-thesis} describes DMV initialization using a ``harmonic
498 distribution'' for the initial probabilities, where the probability of
499 one word heading another is higher if they appear closer to one
500 another.
502 There are several ways this could be implemented. We initialized attachment
503 probabilities with the following formula:
505 \begin{align*}
506 P_{ATTACH}(a|h,right) = \frac
507 {\sum_{s \in S}\sum_{\LOC{h} \in s} \sum_{\LOC{a} \in s:loc(\LOC{a})>loc(\LOC{h})} 1/(loc(\LOC{a})-loc(\LOC{h})) + C_A}
508 {\sum_{s \in S}\sum_{\LOC{h} \in s} \sum_{\LOC{w} \in s:loc(\LOC{w})>loc(\LOC{h})} 1/(loc(\LOC{w})-loc(\LOC{h})) + C_A}
509 \end{align*}
511 The probability of stopping adjacently (left or right) was increased
512 whenever a word occured at a (left or right) sentence
513 border\footnote{For non-adjacent stopping we checked for occurence at
514 the second(-to-last) position.}:
516 \begin{align*}
517 f(stop:\LOC{h},left,adj)=\begin{cases}
518 C_S \text{, if } loc(\LOC{h}) = 0,\\
519 0 \text{, otherwise}
520 \end{cases}
521 \end{align*}
523 \begin{align*}
524 P_{STOP}(stop|h,left,adj) = \frac
525 {C_{M} + \sum_{s \in S}\sum_{\LOC{h} \in s} f(stop:\LOC{h},left,adj)}
526 {C_{M} + \sum_{s \in S}\sum_{\LOC{h} \in s} C_S+C_N}
527 \end{align*}
529 Tweaking the initialization constants $C_A, C_M, C_S$ and $C_N$
530 allowed us to easily try different inital distributions (eg. to modify
531 the initial grammar in the direction of a uniform distribution). The
532 next section discusses the effect of these.
534 \subsection{Results}
535 We compared the results of the implementation with a dependency parsed
536 version of the WSJ-10 corpus (converted from the manually annotated
537 version using a Perl script by Valentin Jijkoun). Since single word
538 sentences were not POS-tagged there, these were skipped. Also, the
539 dependency parsed WSJ-10 did not have ROOT nodes; so we calculated
540 precision and recall both without counting our ROOT links, and with
541 counting ROOT links, by adding these to the gold parses where
542 possible\footnote{221 parses in the dependency parsed WSJ-10 had
543 several tokens appearing as heads without appearing as dependents in
544 the same parse, here we skipped the parses when calculating with
545 ROOT links. Our gold standard also sometimes (for 1249 sentences)
546 had one dependent with two heads, these were skipped from
547 evaluation. We have not yet had time to run evaluation of undirected
548 dependencies.}.
550 \begin{table*}[hb]
551 \centering
552 \begin{tabular}{l|ccc}
553 Model & P & R & F1 \\
554 \hline
555 LBRANCH/RHEAD & & & 33.6 \\
556 RANDOM & & & 30.1 \\
557 RBRANCH/LHEAD & & & 24.0 \\
558 K\&M's DMV & & & 43.2 \\
559 Our DMV: \\
560 Uniform initial distribution & 21.0 (18.7) & 20.1 (18.1) & 20.5 (18.4) \\
561 $C_A=0; C_S=1;C_N=0.1;C_M=1$ & 24.8 (24.5) & 23.7 (23.7) & 24.2 (24.1) \\ % local screen
562 $C_A=0; C_S=1;C_N=0.1;C_M=10$ & 26.7 (31.6) & 25.5 (30.5) & 26.1 (31.0) \\ % uib screen
563 % $C_A=10;C_S=1;C_N=0.1;C_M=10$ & 25.6 (30.6) & 24.5 (29.6) & 25.0 (30.1) \\
564 $C_A=10;C_S=1;C_N=3 ;C_M=10$ & & & 26.3 (31.4) \\
565 $C_A=15;C_S=3;C_N=1 ;C_M=20$ & & & 27.2 (32.2) \\
566 \end{tabular}
567 \caption{DMV results on the WSJ-10 for various initialization values (numbers in parentheses are when counting added ROOT links)}
568 \label{tab:dmv-wsj}
569 \end{table*}
571 We tried various values for the initialization constants; but it was
572 hard to find any clear pattern for what worked best.
573 %$C_N$ values of 10 or less worked well,
574 % todo
576 Table \ref{tab:dmv-wsj} shows the results of running our
577 implementation on the WSJ-10 corpus, compared with the dependency
578 parsed version, for given values of the initialization constants. The
579 EM hilltop seemed to be reached after about 30 iterations for our best
580 results (sooner for the worse results). As the table indicates
581 (especially the numbers for ``rooted'' evaluation), different initial
582 settings can lead to very different results\footnote{One alternative
583 idea we still haven't tried is basing initial DMV stopping
584 probabilities on \emph{distance} from sentence start/end, instead of
585 the ``binary'' test given above.}.
588 The numbers are not completely comparable though, since our corpus was
589 6268 sentences, while \citet{km-dmv} report 7422 sentences; also it is
590 not clear whether their DMV-experiment was run using automatically
591 induced word classes \citep[Schütze, 1995, in][p.~8]{km-dmv} or on the
592 tagset used to manually annotate the WSJ-10.
594 % underproposed = # $C_A=10;C_S=1;C_N=3 ;C_M=10$
595 % {(('NN', 1), ('DT', 0)): 347,
596 % (('NN', 2), ('DT', 0)): 148,
597 % (('NN', 3), ('DT', 2)): 136,
598 % (('NN', 4), ('DT', 3)): 144,
599 % (('NN', 5), ('DT', 4)): 128,
600 % (('NN', 6), ('DT', 5)): 124,
601 % (('NNP', 1), ('NNP', 0)): 358,
602 % (('NNP', 2), ('NNP', 0)): 125,
603 % (('NNP', 2), ('NNP', 1)): 174,
604 % (('NNS', 1), ('JJ', 0)): 124,
605 % (('ROOT', -1), ('NN', 0)): 100,
606 % (('ROOT', -1), ('NN', 1)): 106,
607 % (('ROOT', -1), ('NNP', 1)): 140,
608 % (('ROOT', -1), ('NNP', 2)): 104,
609 % (('VBD', 2), ('NN', 1)): 145,
610 % (('VBP', 2), ('NNS', 1)): 111,
611 % (('VBZ', 2), ('NN', 1)): 152,
612 % (('VBZ', 2), ('NNP', 1)): 107,
613 % (('VBZ', 3), ('NN', 2)): 109, }
615 In the stochastic grammars given by the model after EM, the POS class
616 $VBD$ (past tense verb) had the highest probability of being attached
617 to by ROOT, followed by $VBZ$ (3sg present) and $VBP$ (non-3sg
618 present); which looks promising. Interestingly, on fourth place we see
619 the tag $:$, and on sixth, $``$. Results might have been different
620 with such tokens stripped from the corpus and the evaluation standard,
621 we haven't tried this yet\footnote{A ROOT attachment to $:$ does make
622 sense though, in the same way that a comma might be the head of a
623 conjunction. Our evaluation standard almost always
624 % 97 per cent of the time..
625 puts colons as dependents of the phrase head (often an NP
626 fragment).}.
628 Just like \citet[p.~89]{klein-thesis} our algorithm chose determiners
629 to be the heads of nouns, rather than the opposite.
630 % Also, in general
631 % non-adjacent stopping was a lot more likely than adjacent stopping, a
632 % situation given by the DMV model.
635 % \section{The combined model (?)}
636 % \subsection{Results (?)}
639 \section{Conclusion}
640 From our implementations it seems that these processes capture some of
641 the statistical regularities in language, eg. the strong likelihood of
642 verbal sentence heads; however, our evaluations show that we did not
643 manage to replicate Klein \& Manning's parsing results, let alone
644 improve on them.
647 \nocite{lari-csl90}
648 \nocite{klein-thesis}
649 \nocite{km-dmv}
650 \bibliography{./statistical.bib}
651 \bibliographystyle{plainnat}
653 \end{document}