Move multiple translations
[apertium.git] / apertium2-documentation-en / documentation.tex
blob83bd0188bee3b6608058e1d5785219bc92ee6ba8
1 \documentclass [12pt,a4paper]{book}
2 %\ifx\pdfoutput\undefined
3 %\usepackage[dvips]{graphicx}
4 %\else
5 %\usepackage[pdftex]{graphicx}
6 %\usepackage{type1cm}
7 %\fi
8 \usepackage[dvips]{graphicx}
9 \usepackage{rotating}
10 \usepackage{palatino,helvet}
11 \usepackage[english]{babel}
12 \usepackage[latin1]{inputenc}
13 \usepackage{sectsty}
14 \usepackage{alltt}
15 \usepackage[small,bf]{caption}
16 \usepackage{url}
17 \usepackage{rotating}
18 \usepackage{longtable}
19 % \usepackage{tocvsec2}
21 % \allsectionsfont{\sffamily}
25 \setcounter{secnumdepth}{3}
26 %\setcounter{tocdepth}{3} %%(so that index reaches the third level, more specific)
28 % Line break after \paragraph
29 \makeatletter % so that '@' is recognized as a normal character
30 \renewcommand{\paragraph}{\@startsection{paragraph}{4}{\z@}{-3.25ex \@plus
31 -1ex \@minus -.2ex}{1.5ex \@plus .2ex}{\normalfont\normalsize\bfseries}}
32 \makeatother % so that '@' is again a special character
36 % \newcommand{\nota}[1]{ \begin{small}
37 % \begin{quote}
38 % \begin{sf}
39 % [Nota: #1]
40 % \end{sf}
41 % \end{quote}
42 % \end{small}
43 % }
45 \newcommand{\nota}[1]{}
48 \newcommand{\notavisible}[1]{
49 \begin{small}
50 \begin{quote}
51 \begin{sf}
52 [#1]
53 \end{sf}
54 \end{quote}
55 \end{small}
59 %% Project ``Open Source Machine Translation for the languages of Spain (FIT-340101-2004-3) \\[.5ex]
60 \frontmatter
62 \title{\sffamily\bfseries Documentation of the Open-Source
63 Shallow-Transfer Machine Translation Platform \emph{Apertium}}
64 %%\date{28 June 2005}
68 \author{\textbf{AUTHORS}:\\Mikel L. Forcada\\Boyan Ivanov
69 Bonev\\Sergio Ortiz Rojas\\ Juan Antonio Pérez Ortiz \\
70 Gema Ramírez Sánchez\\Felipe Sánchez
71 Martínez\\ Carme Armentano-Oller\\ Marco A.\ Montava \\ Francis M.\ Tyers\\\\\textbf{EDITOR}:\\Mireia Ginestí
72 Rosell\\[0.8cm]\\Departament de Llenguatges i Sistemes
73 Informàtics\\Universitat d'Alacant}
74 %% \textit{Eleka Ingeniaritza Linguistikoa} \\
75 %% \textit{Zelai Haundi Kalea, 3} \\
76 %% \textit{Osinalde Industrialdea} \\
77 %% \textit{20170 Usurbil}}
81 \begin{document}
82 \pagestyle{headings}
83 %\maxtocdepth{subsubsection}
84 %\maxtocdepth{paragraph}
86 %\settocdepth{subsubsection}
88 \maketitle
91 \newpage \thispagestyle{empty}
93 \bigskip
94 \begin{quote} Copyright \copyright 2007 Grup Transducens, Universitat
95 d'Alacant. Permission is granted to copy, distribute and/or modify
96 this document under the terms of the GNU Free Documentation License,
97 Version 1.2 or any later version published by the Free Software
98 Foundation; with no Invariant Sections, no Front-Cover Texts, and no
99 Back-Cover Texts. A copy of the license can be found in
100 \url{http://www.gnu.org/copyleft/fdl.html}.
102 % The unofficial
103 % translation of the license to Spanish can be found at
104 % \url{http://gugs.sindominio.net/licencias/gfdl-1.2-es.html}, the
105 % unofficial translation to Catalan can be found at
106 % \url{http://www.softcatala.org/llicencies/fdl-ca.html}, and the
107 % unofficial translation to Galician can be found at
108 % \url{http://members.tripod.com.br/RamonFlores/GNU/gpl.html}.
111 \notavisible{Shouldn't we license this under GPL or another license that is free in Debian terms?
113 Make sure 1.2 is the right license.
115 Perhaps we don't want ``or any later version''.
117 Check the author list. We might have forgotten someone.}
119 \end{quote}
122 \bigskip
125 \tableofcontents
127 \newpage
129 \mainmatter
130 \chapter*{Introduction}\addcontentsline{toc}{chapter}{Introduction}
133 This documentation describes the Apertium platform, one of the
134 open-source machine translation systems which originated within the
135 project "Open-Source Machine Translation for the Languages of Spain"
136 ("Traducción automática de código abierto para las lenguas del estado
137 español")\nota{Posem un resum de les dades del projecte en un apèndix
138 (codi, durada, finançament, participants, etc.) i hi fem referència
139 ací? - Mikel}. It is a shallow-transfer machine translation system,
140 initially designed for the translation between related language pairs,
141 although some of its components have been also used in the
142 deep-transfer architecture (\emph{Matxin}) that has been developed in
143 the same project for the pair Spanish-Basque. \emph{Apertium} can
144 translate at present between the pairs Spanish-Galician,
145 Spanish--Catalan\footnote{With the name \emph{Catalan} we refer also
146 to the Valencian dialectal variant of this language.}
147 Catalan-Occitan, Catalan-French, and can be used to build translators
148 between other related language pairs, such as
149 Danish-Swedish,Czech--Slovak, etc. \notavisible{Update the
150 language-pair list!} \notavisible{I think it is very important to say in this paragraph that the system has been extended}
153 \notavisible{The next paragraph needs updating or generalizing:}
154 Existing machine translation systems available at present for the
155 pairs \texttt{es}--\texttt{ca} and \texttt{es}--\texttt{gl} are mostly
156 commercial or use proprietary technologies, which makes them very hard
157 to adapt to new usages; furthermore, they use different technologies
158 across language pairs, which makes it very difficult to integrate them
159 in a single multilingual content management system.
161 One of the main novelties of the architecture described here is that
162 it has been released under open-source licenses (in most cases, GNU
163 GPL; some data still have a Creative Commons license) and is
164 distributed free of charge. This means that anyone having the
165 necessary computational and linguistic skills will be able to adapt or
166 enhance the platform or the language-pair data to create a new machine
167 translation system, even for other pairs of related languages. The
168 licenses chosen make these improvements immediately available to
169 everyone. We therefore expect that the introduction of this of
170 open-source machine translation architecture will solve some of the
171 mentioned problems (having different technologies for different pairs,
172 closed-source architectures being hard to adapt to new uses, etc.) and
173 promote the exchange of existing linguistic data through the use of
174 the XML-based formats defined in this documentation. On the other
175 hand, we think that it will help shift the current business model from
176 a license-centred one to a services-centred one.
178 It is worth mentioning that "Open-Source Machine Translation for the
179 Languages of Spain" was the first large open-source machine
180 translation project funded by the central Spanish Government, although
181 the adoption of open-source software by the Spanish governments is not
182 new.
184 \notavisible{Don't forget about the other funding agencies supporting open source MT; this needs some contextualization, relating to funding, etc. Mention later funding and refer to the appropriate section.}
186 This documentation describes in detail the characteristics of the
187 Apertium platform, and is organized as follows:
190 \begin{itemize}
191 \item Chapter \ref{ss:descrarq}: \textbf{general description} of the
192 shallow-transfer machine translation system and of the modules that
193 make it up.
195 \item Chapter \ref{se:flujodatos}: description of the \textbf{format
196 of the data stream} that circulates from one module to the next one.
198 \item Chapter \ref{se:especificmodulos}: \textbf{specification of the
199 modules} of the system. For each module there is a description of: the
200 \textit{program} and its characteristics, the \textit{format of the data}
201 that the module uses, and the \textit{compilers} used for it.
202 This chapter is divided in the following sections:
203 \begin{itemize}
204 \item [-]Section \ref{ss:modproclex}: \emph{Lexical processing
205 modules}, where the morphological analyser, the lexical transfer
206 module, the morphological generator and the post-generator are
207 described (Section \ref{ss:funcproclex}), along with the format of
208 the dictionaries used by these modules (section
209 \ref{ss:diccionarios}) and their compilers (section
210 \ref{se:compiladoresdic})
211 \item [-]Section \ref{ss:tagger}: \emph{Part-of-speech Tagger},
212 which describes the tagger (Section \ref{functagger}) and the
213 format of the linguistic data used by the tagger (section
214 \ref{datostagger}.
215 % MLF 20060328 elimina % y el compilador % correspondiente (apartado
216 %\ref{ss:gentagger})
218 \nota{falta parlar del lextor, i afegir-ho a tot arreu on es parli
219 dels mòduls del sistema}
221 \item [-]Section \ref{se:pretransfer}: \emph{Pre-transfer module},
222 which describes the module that runs before the structural
223 transfer module to perform some operations on multiword units
224 \item [-]Section \ref{ss:transfer}: \emph{Structural transfer
225 module}, where there is a description of the program (section
226 \ref{functransfer}) and of the format of the structural transfer
227 rules (Section \ref{formatotransfer}).
228 % MLF 20060328 % y el % compilador correspondiente (apartado
229 % \ref{gentransfer})
230 \item [-]Section \ref{se:desformat}: \emph{De-formatter and
231 Re-formatter}, which describes these modules (section
232 \ref{ss:formato}), the rules for format processing (section
233 \ref{ss:reglasformato}) and how these modules are generated
234 (Section \ref{se:gendeformat})
236 \end{itemize}
240 \item Chapter \ref{se:instalacion}: it describes the way to
241 \textbf{install the system} and to \textbf{run the translator}.
243 \item Chapter \ref{se:datosling}: here you will find an explanation of
244 how to \textbf{modify the linguistic data} used by the translator,
245 that is, the dictionaries, the part-of-speech disambiguation data
246 and the structural transfer rules created in this project for
247 Spanish, Catalan and Galician. Furthermore, it contains a brief
248 description of the characteristics of the
249 available data for these three language pairs.
250 \notavisible{I would try to be more general, and perhaps remove this section or update with some other pairs. Any ideas on how to do this?}
253 \nota{Es diuen a tot arreu els noms de programa i en quin paquet
254 estan?}
257 \end{itemize}
260 The files which this documentation refers to can be found at and
261 downloaded from the project web page in Sourceforge:
262 \url{http://apertium.sourceforge.net/}. From this page you can
263 download the packages needed for installation, as well as view the
264 individual files in the SVN (main) and CVS (residual) repositories of
265 the project. The machine translation systems for the different
266 language pairs can also be tested in Internet at
267 \url{http://xixona.dlsi.ua.es/apertium/}.
269 \notavisible{Shouldn't we mention the debugging interfaces?}
270 \notavisible{Should we define SVN and CVS?}
272 %El presente documento tiene algunas secciones que están incompletas o
273 %no han sido escritas todavía.
276 \paragraph*{Acknowledgements:} The present work has benefited from the
277 contribution of many people and institutions:
278 \begin{itemize}
279 \item The Spanish Ministry of Industry, Commerce and Tourism has
280 funded the development of this toolbox through the projects
281 ``Open-Source Machine Translation for the Languages of Spain'', code
282 FIT-340101-2004-3, and its extension FIT-340001-2005-2, and
283 ``EurOpenTrad: Open-Source Advanced Machine Translation for the
284 European Integration of the Languages of Spain'', code
285 FIT-350101-2006-5, all of them belonging to the PROFIT program.
289 \item Workers and scholars from other machine translation projects at
290 the Universitat d'Alacant: Míriam Antunes Scalco, Carme Armentano i
291 Oller, Raül Canals i Marote, Alicia Garrido Alenda, Patrícia Gilabert
292 i Zarco, Maribel Guardiola i Savall, Javier Herrero Vicente, Amaia
293 Iturraspe Bellver, Sandra Montserrat i Buendia, Hermínia Pastor Pina,
294 Antonio Pertusa Ibáñez, Francisco Javier Ramos Salas, Marcial Samper
295 Asensio and Miguel Sánchez Molina.
296 \item The companies and institutions that have funded these other
297 machine translation projects: Spanish Ministry of Science and
298 Technology, Caja de Ahorros del Mediterráneo, Universitat d'Alacant
299 and Portal Universia, S.A.
300 \item Iñaki Alegria, from the Ixa group of the Euskal Herriko
301 Unibertsitatea (University of the Basque Country), for his close
302 reading of previous versions of this document.
303 \end{itemize}
305 \vspace{12cm}
310 \chapter[The translation engine]{The shallow-transfer machine
311 translation engine }
312 \label{ss:descrarq}
315 This chapter describes briefly the structure of the shallow-transfer
316 machine translation engine, which is largely based on that of the
317 existing systems for Spanish--Catalan \textsf{interNOSTRUM}
318 \cite{canals01b,garridoalenda01p,garrido99j} and for
319 Spanish--Portuguese \textsf{Traductor Universia} \cite{garrido03p,
320 gilabert03j}, both developed by the Transducens group of the
321 Universitat d'Alacant. It is a classical indirect translation system
322 that uses a partial syntactic transfer strategy similar to the one
323 used by some commercial MT systems for personal computers.
326 The design of the system makes it possible to produce MT systems that
327 are \emph{fast} (translating tens of thousands of words per second in
328 ordinary desktop computers) and that achieve results that are, in spite of
329 the errors, reasonably intelligible and easily correctable. In the
330 case of related languages such as the ones involved in the project
331 (Spanish, Galician, Catalan), a mechanical word-for-word translation
332 (with a fixed equivalent) would produce errors that, in most of the
333 cases, can be solved with a quite rudimentary analysis (a
334 morphological analysis followed by a superficial, local and partial
335 syntactic analysis) and with an appropriate treatment of lexical
336 ambiguities (mainly due to homography). The design of our system
337 follows this approach with very interesting results. The Apertium
338 architecture uses finite-state transducers for lexical processing,
339 hidden Markov models for part-of-speech tagging and finite-state-based
340 chunking for structural transfer.
343 The translation engine consists of an 8-module \emph{assembly line},
344 which is represented in Figure \ref{fg:modules}. To ease diagnosis
345 and independent testing, modules communicate between them using text
346 streams. This way, the input and output of the modules can be checked
347 at any moment and, when an error in the translation process is
348 detected, it is easy to test the output of each module separately to
349 track down the origin of the error. At the same time, communication
350 via text allows for some of the modules to be used in isolation,
351 independently form the rest of the MT system, for other
352 natural-language processing tasks, and enables the construction of
353 prototypes with modified or additional modules.
355 We decided to encode linguistic data files in
356 XML\footnote{\url{http://www.w3.org/XML/}}-based formats due to its
357 interoperability, its independence on the character set and the
358 availability of many tools and libraries that make easy the analysis
359 of data in this format. As stated in \cite{ide00}, XML is the
360 emerging standard for data representation and exchange in
361 Internet. Technologies around XML include very powerful mechanisms for
362 accessing and editing XML documents, which will probably have a
363 significant impact on the development of tools for natural language
364 processing and annotated corpora.
367 The modules Apertium consists of are the following:
369 \begin{figure*} {\footnotesize \setlength{\tabcolsep}{0.5mm}
370 \begin{center}
371 \begin{tabular}{cccccccc}
373 \parbox{0.7cm}{SL text} \\
374 $\downarrow$ \\
375 \framebox{\parbox{1.4cm}{de\-formatter}} $\rightarrow$ &
376 \framebox{\parbox{0.8cm}{morph. anal.}} $\rightarrow$ &
377 \framebox{\parbox{1.2cm}{PoS tagger}} $\rightarrow$ &
378 \framebox{\parbox{1.1cm}{struct.\ transf.}} $\rightarrow$ &
379 \framebox{\parbox{0.8cm}{morph. gen.}} $\rightarrow$ &
380 \framebox{\parbox{1.0cm}{post\-genera\-tor}} $\rightarrow$ &
381 \framebox{\parbox{1.2cm}{re-format\-ter}} \\ & & & $\updownarrow$ & &
382 & $\downarrow$ \\ & & & \framebox{\parbox{1.0cm}{lex.\ transfer}} & &
383 & \parbox{0.7cm}{TL text}\\\\
384 \end{tabular}
385 \end{center} }
386 \caption{The eight modules that build the assembly line of the
387 shallow-transfer machine translation system.}
388 \label{fg:modules}
389 \label{pg:modules}
390 \end{figure*}
394 \begin{itemize}
395 \item The \emph{de-formatter}, which separates the text to be
396 translated from the format information (RTF, HTML, etc.); its
397 specification can be found in Section \ref{ss:formato}. Format
398 information is encapsulated so that the rest of the modules treat it
399 as blanks between words. For example, for the HTML text in Spanish:
400 \begin{alltt}
401 es <em>una señal</em>
402 \end{alltt}
403 ("it is a sign") the de-formatter encapsulates in brackets
404 the HTML tags and gives the output:
405 \begin{alltt}
406 es [<em>]una señal[</em>]
407 \end{alltt}
408 The character sequences in brackets are treated by the
409 rest of the modules as simple blanks between words.
410 \item \label{pg:FSFL} The \emph{morphological analyser}, which
411 tokenizes the text in \emph{surface forms} (SF) (lexical units as
412 they appear in texts) and delivers, for each SF, one or more
413 \emph{lexical forms} (LF) consisting of \emph{lemma} (the base form
414 commonly used in classic dictionary entries), the \emph{lexical
415 category} (noun, verb, preposition, etc.) and morphological
416 inflection information (number, gender, person, tense,
417 etc.). Tokenization of a text in SFs is not straightforward due to
418 the existence, on the one hand, of contractions (in Spanish,
419 \emph{del}, \emph{teniéndolo}, \emph{vámonos}; in English,
420 \emph{didn't}, \emph{can't}) and, on the other hand, of lexical
421 units made of more than one word (in Spanish, \emph{a pesar de},
422 \emph{echó de menos}; in English, \emph{in front of}, \emph{taken
423 into account}). The morphological analyser is able to analyse these
424 complex SFs and treat them appropriately so that they can be
425 processed by the next modules. In the case of contractions, the
426 system reads a single surface form and gives as output a sequence of
427 two or more lexical forms (for instance, the Spanish
428 preposition-article contraction \emph{del} would be analysed into
429 two lexical forms, one for the preposition \emph{de} and another one
430 for the article \emph{el}). Lexical units made of more than one word
431 (multiwords) are treated as single lexical forms and processed
432 specifically according to its type.\footnote{For more information
433 about the treatment of multiwords, please refer to page
434 ~\pageref{ss:multipalabras}.}
436 Upon receiving as input the example text from the previous module, the
437 morphological analyser would deliver:
438 \begin{alltt}
439 ^es/ser<vbser><pri><p3><sg>\$[ <em>]
440 ^una/un<det><ind><f><sg>/unir<vblex><prs><1><sg>/unir
441 <vblex><prs><3><sg>\$
442 ^señal/señal<n><f><sg>\$[</em>]
443 \end{alltt}
445 where each surface form has been analysed into one or more lexical
446 forms: \emph{es} has been analysed as one SF with lemma \emph{ser}
447 ("to be"), whereas \emph{una} receives three analyses: lemma \emph{un}
448 ("one"), determiner, indefinite, feminine, singular; lemma \emph{unir}
449 ("to join"), verb in subjunctive present, 1st person singular, and
450 lemma \emph{unir}, verb in subjunctive present, 3rd person singular.
452 This module is generated from a source language (SL) morphological
453 dictionary, the format of which is specified in section
454 \ref{ss:diccionarios}.
455 \item The \emph{part-of-speech tagger} chooses, using a statistical
456 model (hidden Markov model), one of the analyses of an ambiguous word
457 according to its context; in the previous example, the ambiguous word
458 would be the surface form \emph{una}, which can have three different
459 analyses. A sizeable fraction of surface forms (in Romance languages,
460 for instance, around one out of every three words) are ambiguous, that
461 is, they can be analysed into more than one lemma, more than one
462 part-of-speech or have more than one inflection analysis, and are
463 therefore an important source of translation errors when the wrong
464 equivalent is chosen. The statistical model is trained on
465 representative source-language text corpora.
467 The result of processing the example text delivered by the
468 morphological analyser with the part-of-speech tagger would be:
470 \begin{alltt}
471 ^ser<vbser><pri><p3><sg>\$[ <em>]^un<det><ind><f><sg>\$
472 ^señal<n><f><sg>\$[</em>]
473 \end{alltt}
475 where the correct lexical form (determiner) has been selected for the
476 word \emph{una}.
479 The specification of the part-of-speech tagger is in section
480 \ref{ss:tagger}.
483 \item The \emph{lexical transfer module}, that uses a bilingual
484 dictionary and is called by the structural transfer module, reads each
485 LF of the SL and delivers the corresponding target language (TL)
486 lexical form. The dictionary contains a single equivalent for each SL
487 lexical form; that is, no word-sense disambiguation is performed
488 \nota{now not true: lextor}. Multiwords are translated as a single unit.
489 The lexical forms in the running example would be translated into
490 Catalan as follows:
492 \begin{alltt}
493 ser<vbser> \(\longrightarrow\) ser<vbser>
494 un<det> \(\longrightarrow\) un<det>
495 señal<n><f> \(\longrightarrow\) senyal<n><m>
496 \end{alltt}
498 This module is generated from a bilingual dictionary, which is
499 described in Section \ref{ss:diccionarios}.
501 \item The \emph{structural transfer module}, which detects and
502 processes patterns of words (chunks or phrases) that need special
503 processing due to grammatical divergences between the two languages
504 (gender and number changes, word reorderings, changes in prepositions,
505 etc.). This module is generated from a file containing rules which
506 describe the action to be taken for each pattern. In the running
507 example, the pattern formed by
508 \verb!^!\texttt{un<det><ind><f><sg>}\verb!$!
509 \verb!^!\texttt{señal<n><f><sg>}\verb!$! would be detected by a
510 determiner--noun rule, which in this case would change the gender of
511 the determiner so that it agrees with the noun; the result would be:
513 \begin{alltt}
514 ^ser<vbser><pri><p3><sg>\$[ <em>]^un<det><ind><m><sg>\$
515 ^senyal<n><m><sg>\$[</em>]
516 \end{alltt}
518 The format of the structural transfer rules file, inspired in the one
519 described in \cite{garridoalenda01p}, is specified in Section
520 \ref{ss:transfer}.
521 \item The \emph{morphological generator}, that, from a lexical form in
522 the target language, generates a suitably inflected surface form. The
523 result for the example phrase would be:
524 \begin{alltt}
525 és[ <em>]un senyal[</em>]
526 \end{alltt}
528 This module is generated from a morphological dictionary, which is
529 described in detail in Section \ref{ss:diccionarios}.
530 \item The \emph{post-generator}, that performs some orthographic
531 operations in the TL such as contractions and apostrophations, and
532 which is generated from a transformation rules file the format of
533 which is very similar to the format of the mentioned dictionaries. Its
534 format is specified in Section \ref{ss:diccionarios}. In the example
535 text there is no need to perform any contraction or apostrophation.
536 \item The \emph{re-formatter}, which restores the original format
537 information into the translated text; the result for the running
538 example would be the correct conversion of the text into HTML format:
539 \begin{alltt}
540 és <em>un senyal</em>
541 \end{alltt}
544 The specification of the re-formatter is described in Section
545 \ref{ss:formato}.
546 \end{itemize}
548 The four lexical processing modules (morphological analyser, lexical
549 transfer module, morphological generator and post-generator) use a
550 single compiler, based on a class of \emph{finite-state transducers}
551 \cite{garrido99j}, in particular, letter transducers
552 \cite{roche97,ortiz05j}; its characteristics are described in Section
553 \ref{se:compiladoresdic}.
557 \chapter[Stream format specification]{Format specification of the
558 data stream between modules}
559 \label{se:flujodatos} \nota{Material duplicat en "formatadors i
560 reformatadors": declarar-ho, treure-ho? - feina Gema}
562 \section{Introduction}
564 The format of the data that circulate between the engine's modules has
565 to be specified so that document processing is more effective and
566 transparent. The proposed system design (see Section
567 \ref{ss:descrarq}) imposes the need to use three different data stream
568 types, as shown in Figure \ref{fig:fdatos}.
570 The stream format is text-based to facilitate, among other things, the
571 diagnosis of possible system errors, since it is easy to manipulate
572 the stream in order to reproduce the phenomena that are to be tested,
573 and change it to see the result. Other benefits of using text streams
574 are that it is possible to test independently the output of each
575 module, and that it allows for fast building of prototypes to test the
576 system's global performance, the validity of linguistic data, etc.
580 \begin{figure}[h]
581 \begin{center}
582 \includegraphics[width=14cm]{fdatos}
583 \end{center}
584 \caption{The different data stream types in the machine translation
585 system. See the text for its description.}
586 \label{fig:fdatos}
587 \end{figure}
589 The data stream types are:
591 \begin{itemize}
592 \item \textit{Data stream with format:} It is the text in its original
593 format, with no further marks: XML, ANSI text, RTF, HTML, etc. Since
594 it is the original format of the documents, nothing needs to be
595 specified about it except the name of the format.
596 \item \textit{Data stream without format:} It is the text with
597 \textit{superblanks}, that is, with special characters that
598 encapsulate the format (see Section \ref{ss:formato}); superblanks are
599 treated by the linguistic modules as blanks between words (with some
600 exceptions). This is the format generated by the de-formatter and
601 used by the re-formatter when generating the final translated
602 document.
603 \item \textit{Segmented data stream:} In this format, apart from
604 superblanks, lexical units that are to be translated are delimited
605 also with special characters. These characters are put by the
606 morphological analyser and deleted by the generator, which delivers
607 the final surface forms.
608 \end{itemize}
611 We describe next the characteristics of the data stream used between
612 the modules of the translator, that is, the second and the third
613 stream types. In general terms, it is a plain text format marked with
614 characters that have a special meaning. This format is intended for
615 the processing in servers that translate large volumes of text.
617 Some of the formats that the engine can process may contain extensive
618 blocks of information in binary format ---RTF for instance, that may
619 include bitmap images---. To enable an efficient processing of this
620 type of documents, we designed a way to extract this information and
621 restore it after translation has been performed; see Section
622 \ref{ss:formato} for a complete description.
624 \section{Data stream without format}
626 Data stream without format is output by the de-formatter and by the
627 generator \nota{no del tot: postgenerador}, and is used as input by
628 the morphological analyser, the post-generator and the re-formatter.
630 In the
631 subsection of this section you can find a description of the method to
632 delimit \textit{superblanks} and \textit{extensive superblanks}. As an example
633 we will use the HTML document in
634 Figure~\ref{fg:docorig}.
636 \begin{figure}[htbp]
637 \begin{small}
638 \begin{alltt}
639 <\textbf{html}>
640 <\textbf{head}>
641 <\textbf{title}>Title</\textbf{title}>
642 </\textbf{head}>
643 <\textbf{body}>
644 <\textbf{p}>Divided
645 sentence</\textbf{p}>
646 </\textbf{body}>
647 </\textbf{html}>
648 \end{alltt}
649 \end{small}
650 \caption{Example of HTML document}
651 \label{fg:docorig}
652 \end{figure}
654 The structural elements that must include this data stream type are
655 the following:
657 \begin{itemize}
658 \item \textit{Superblanks}. Blocks that contain segments of format
659 information included in the documents, when these are short.
660 \item \textit{Extensive superblanks}. Marks that are used to specify
661 external documents that include segments of format information for the
662 document being processed, when these segments are long.
663 \item \textit{Text}. The document text that can be translated.
664 \item \textit{Artificial sentence endings}. \label{finfrase} When the
665 format in the document suggests a sentence separation that is not
666 signalled by any punctuation mark (for instance, titles with no full
667 stop at the end, or the content of cells in a table), the format
668 processing must have a mechanism (invisible for the user) that enables
669 the marking of these sentence endings.
670 \item \textit{Special characters protection (for non-XML stream)}.
671 Characters that must be protected to avoid conflict with the ones
672 used in the data stream format.
673 \end{itemize}
675 % \subsection{XML format}
677 % En este tipo de flujo se usa el elemento \texttt{<\textbf{b}>} para definir los
678 % superblancos y los superblancos extensos. Para el caso de los
679 % \textbf{superblancos} la sintaxis es la siguiente:
681 % \begin{small}
682 % \begin{alltt} % <\textbf{b}>\textit{contenido del bloque de formato}</\textbf{b}>
683 % \end{alltt}
684 % \end{small}
686 % Hay que resaltar que para los formatos basados en SGML, es necesario
687 % incluir el formato en bloques \texttt{<![CDATA[\ldots]]>} dentro de
688 % las marcas indicadas. \nota{millor dir com són: prendre text de EAMT
689 % '05 - Gema} Por su parte, los \textit{superblancos extensos} se deben
690 % expresar, a modo de atributos, de la siguiente manera:
692 % \begin{small}
693 % \begin{alltt} % <\textbf{b} \textsl{filename}="\textit{nombre de fichero}"/>
694 % \end{alltt}
695 % \end{small}
697 % El \emph{texto} estará incluido entre los elementos \textbf{b} que se
698 % acaban de explicar sin ninguna marca de estructura particular.
700 % Los \emph{finales de frase artificiales} se expresan mediante un punto y un
701 % superblanco vacío inmediatamente a continuación.
703 % \begin{small}
704 % \begin{alltt} % .<\textbf{b}/>
705 % \end{alltt}
706 % \end{small}
708 % Resumiendo, el flujo de datos de un documento en cualquier formato de los que
709 % trata el traductor se reduce a otro documento XML que debe cumplir la
710 % siguiente DTD:
712 % \begin{small}
713 % \begin{alltt} % <!\textsl{ELEMENT} \textbf{document} (b|\textsl{#PCDATA})*>
714 % <!\textsl{ELEMENT} \textbf{b} (\textsl{#PCDATA}?)>
715 % <!\textsl{ATTLIST} b filename \textsl{CDATA} \textsl{#IMPLIED}>
716 % \end{alltt}
717 % \end{small}
719 % El resultado de encapsular el formato del fichero de la
720 % figura~\ref{fg:docorig} en el flujo con formato XML se ve en la
721 % figura~\ref{fg:docorigXML}. Si hubiese algún superblanco que por su longitud
722 % se convirtiese en un superblanco extenso, la forma de especificarlo sería como sigue:
723 % \begin{small}
724 % \begin{alltt} % <\textbf{b} \textsl{filename}="/tmp/ficherotemporal"/>
725 % \end{alltt}
726 % \end{small}donde \texttt{"/tmp/ficherotemporal"} es un fichero que
727 % contiene el superblanco extenso para que pueda ser recuperado por el reformateador.
729 % \begin{figure}
730 % \begin{small}
731 % \begin{alltt} % <?\textbf{xml} \textsl{version}="1.0" \textsl{encoding}="iso-8859-15"?>
732 % <\textbf{document}>
733 % <\textbf{b}><![CDATA[<html> % <head>
734 % <title>]]></\textbf{b}>Título.<\textbf{b}/><\textbf{b}><![CDATA[</title>
735 % </head> % <body> % <p>]]></\textbf{b}>Frase<\textbf{b}><![CDATA[
736 % ]]></\textbf{b}>dividida.<\textbf{b}/><\textbf{b}><![CDATA[ % </body>
737 % </html>]]></\textbf{b}> % </\textbf{document}>
738 % \end{alltt}
739 % \end{small}
740 % \caption{El documento de la figura \protect\ref{fg:docorig} con el
741 % formato encapsulado usando marcas en XML y segmentos
742 %\texttt{<![CDATA[\ldots]]>}}
743 % \label{fg:docorigXML}
744 % \end{figure}
746 %\subsection{Formato no XML}
747 \subsection{Stream format}
748 \label{se:noxml1} This format is based on the one used in the machine
749 translation systems \textsf{interNOSTRUM}
750 \cite{canals01b,garridoalenda01p,garrido99j} and \textsf{Traductor
751 Universia} \cite{garrido03p, gilabert03j}.
753 In this stream type, the characters \texttt{[} and \texttt{]} are used
754 to indicate \emph{superblanks}, as shown in the following example:
756 \begin{small}
757 \begin{alltt}
758 [\textit{superblank content}]
759 \end{alltt}
760 \end{small}
762 In the case of \emph{extensive superblanks}, the file name is
763 specified using the at sign \texttt{@}:
765 \begin{small}
766 \begin{alltt}
767 [@\textit{file name}]
768 \end{alltt}
769 \end{small}
771 The \emph{text} is outside the superblank marks.
773 \emph{Artificial sentence endings}
774 are expressed by a full stop and an empty superblank right after it.
776 \begin{small}
777 \begin{alltt}
779 \end{alltt}
780 \end{small}
782 The following table shows the \textbf{protected characters}:
784 \begin{center}
785 \begin{tabular}{|l|c|c|l|} \hline Name & Character & Protected form&
786 Meaning \\
787 \hline
788 At & \texttt{@} & \verb!\@! & External superblank\\
789 Slash & \texttt{/} & \verb!\/! & Divider of meanings\\
790 Backslash & \verb!\! & \verb!\\! & Protection character \\
791 Caret & \verb!^! & \verb!\^! & Beginning of LF\\
792 Opening square bracket & \texttt{[} & \verb!\[! & Beginning of blank\\
793 Closing square bracket & \texttt{]} & \verb!\]! & End of blank \\
794 Dollar & \verb!$! & \verb!\$! & End of LF\\
795 Greater than & \texttt{>} & \verb!\>! & Begin. of morph. symbol\\
796 Less than & \texttt{<} & \verb!\<! & End of morph. symbol \\
797 \hline
798 \end{tabular}
799 \end{center}
802 Figure ~\ref{fg:docorigtext} shows the document in Figure
803 ~\ref{fg:docorig} after encapsulation.
805 \begin{figure}[here]
806 \begin{small}
807 \begin{alltt}
808 [<html>
809 <head>
810 <title>]Title.[][</title>
811 </head>
812 <body>
813 <p>]Divided[
814 ]sentence.[][</p>
815 </body>
816 <html>]
817 \end{alltt}
818 \end{small}
819 \caption{The document in Figure \protect\ref{fg:docorig} with format
820 encapsulated using square brackets}
821 \label{fg:docorigtext}
822 \end{figure}
825 \section{Segmented data stream}
827 Segmented data stream is the stream that circulates between the
828 modules that handle linguistic information in the translation engine.
829 In this stream, words are delimited and labelled. There are two types
830 of segmented stream:
832 \begin{itemize}
833 \item \textit{Ambiguous segmented stream}. Its main characteristic is
834 that words have a surface form and potentially more than one lexical
835 form (lexical multiform). This stream type is the format in which
836 the morphological analyser provides the input data for the
837 part-of-speech tagger (see diagram \ref{eq:formaanalizada} in page
838 ~\pageref{formaanalizada} for a detailed description of ambiguous
839 segmented stream).
841 \item \textit{Unambiguous segmented stream}. It has only one lexical
842 form for each word and it does not include the surface form. This is
843 the format in which data circulate from the part-of-speech tagger to
844 the transfer module, and from this module to the generator (see
845 diagram \ref{eq:formaanalizada2} in page~\pageref{formaanalizada2} for
846 a detailed description of the format of unambiguous segmented stream).
847 \end{itemize}
849 Furthermore, besides the information already marked in the data stream
850 without format, the new stream has to enable marking of the following
851 information:
853 \begin{itemize}
854 \item \textit{Lexical units}. A lexical unit is made of a surface
855 form (in the case of ambiguous segmented stream) plus one or more
856 lexical forms (the different possible analyses of the SF) with their
857 grammatical symbols.
858 \item \textit{Surface forms (ambiguous segmented stream)}. The word
859 as it appears in the original text.
860 \item \textit{Lexical forms}. The lemma of the word and its
861 grammatical symbols.
862 \item \textit{Grammatical symbols}. They describe the morphological
863 and grammatical attributes of a surface form.
864 \end{itemize}
866 % \subsection{XML format}
868 % Las \textit{palabras} se etiquetan de la forma que se muestra a
869 % continuación:
871 % \begin{small}
872 % \begin{alltt} % <\textbf{w}>\textit{información de la palabra}</\textbf{w}>
873 % \end{alltt}
874 % \end{small}
876 % Para el caso del \textit{flujo de datos segmentado ambiguo}, la
877 % \textit{forma superficial} se indica en el interior de un elemento
878 % \texttt{<\textbf{w}>} mediante el contenido de un único elemento
879 %\texttt{<\textbf{sf}>}. A continuación, se sitúan la forma o
880 %\textit{formas léxicas} que sean necesarias:
882 % \begin{small}
883 % \begin{alltt} % <\textbf{w}> % <\textbf{sf}>\textit{forma superficial}</\textbf{sf}>
884 % <\textbf{lf}>\textit{forma léxica 1}</\textbf{lf}>
885 % <\textbf{lf}>\textit{forma léxica 2 (opcional)}</\textbf{lf}>
886 % ... % </\textbf{w}>
887 % \end{alltt}
888 % \end{small}
890 % Para el caso del flujo no ambiguo, sólo se especifica una única forma léxica.
893 % \begin{small}
894 % \begin{alltt} % <\textbf{w}> % <\textbf{lf}>\textit{forma léxica}</\textbf{lf}> % </\textbf{w}>
895 % \end{alltt}
896 % \end{small}
898 % %% \pagebreak
900 % La DTD de este flujo de datos para textos \textit{sin desambiguar} es la % que se muestra en la figura~\ref{fg:ambdtd} a continuación.
902 % \begin{figure}[here]
903 % \begin{small}
904 % \begin{alltt}
905 % <!\textsl{ELEMENT} \textbf{document} (b|w|\textsl{#PCDATA})*>
906 % <!-- atención, el #PCDATA anterior sigue siendo necesario para los
907 % carácteres no etiquetados y que no forman parte del formato -->
908 % <!\textsl{ELEMENT} \textbf{b} (\textsl{#PCDATA}?)>
909 % <!\textsl{ATTLIST} b filename \textsl{CDATA} \textsl{#IMPLIED}>
910 % <!\textsl{ELEMENT} \textbf{w} (sf,lf+)>
911 % <!\textsl{ELEMENT} \textbf{sf} (\textsl{#PCDATA})>
912 % <!\textsl{ELEMENT} \textbf{lf} (\textsl{#PCDATA}|s)+>
913 % <!\textsl{ELEMENT} \textbf{s} \textsl{EMPTY}>
914 % <!\textsl{ATTLIST} s n \textsl{IDREF #REQUIRED}>
915 % \end{alltt}
916 % \end{small}
917 % \caption{DTD para textos no desambiguados con formato XML}
918 % \label{fg:ambdtd}
919 % \end{figure}
924 % Para los ya \textit{ desambiguados}, los textos deben cumplir la DTD de la figura~\ref{fg:desambdtd}.
926 % \begin{alltt}
927 % <!\textsl{ELEMENT} \textbf{document} (b|w|\textsl{#PCDATA})*>
928 % <!-- atención, el #PCDATA anterior sigue siendo necesario para los
929 % carácteres no etiquetados y que no forman parte del formato -->
930 % <!\textsl{ELEMENT} \textbf{b} (\textsl{#PCDATA}?)>
931 % <!\textsl{ATTLIST} b filename \textsl{CDATA} \textsl{#IMPLIED}>
932 % <!\textsl{ELEMENT} \textbf{w} (lf)>
933 % <!\textsl{ELEMENT} \textbf{lf} (\textsl{#PCDATA}|s)+>
934 % <!\textsl{ELEMENT} \textbf{s} \textsl{EMPTY}>
935 % <!\textsl{ATTLIST} s n \textsl{IDREF #REQUIRED}>
936 % \end{alltt}
937 % \end{small}
938 % \caption{DTD para textos desambiguados con formato XML}
939 % \label{fg:desambdtd}
940 % \end{figure}
942 % La figura~\ref{fg:docorigXML2} muestra un ejemplo de segmentación del flujo
943 % que incluye la forma de encapsular el formato y la información léxica. Este
944 % ejemplo es para el caso de flujo segmentado ambiguo y corresponde al texto
945 % HTML original de la figura~\ref{fg:docorig}.
947 % \begin{figure}[htbp]
948 % \begin{small}
949 % \begin{alltt}
950 % <?\textbf{xml} \textsl{version}="1.0" \textsl{encoding}="iso-8859-15"?>
951 % <document>
952 % <\textbf{b}><![CDATA[<html>
953 % <head>
954 % <title>]]></\textbf{b}>
955 % <\textbf{w}>
956 % <\textbf{sf}>Título<\textbf{sf}>
957 % <\textbf{lf}>Título<\textbf{s} \textsl{n}="n"/><\textbf{s} \textsl{n}="m"/><\textbf{s} \textsl{n}="sg"/></\textbf{lf}>
958 % </\textbf{w}>
959 % <\textbf{w}>
960 % <\textbf{sf}>.</\textbf{sf}>
961 % <\textbf{lf}>.<s n="sent"/></\textbf{lf}>
962 % </\textbf{w}><\textbf{b}/>
963 % <\textbf{b}><![CDATA[</title>
964 % </head>
965 % <body>
966 % <p>]]></\textbf{b}>
967 % <\textbf{w}>
968 % <\textbf{sf}>Frase</\textbf{sf}>
969 % <\textbf{lf}>Frase<s n="n"/><s n="f"/><s n="sg"/></\textbf{lf}>
970 % </\textbf{w}>
971 % <\textbf{b}><![CDATA[
972 % ]]></\textbf{b}>
973 % <\textbf{w}>
974 % <\textbf{sf}>dividida</\textbf{sf}>
975 % <\textbf{lf}>dividir<s n="vblex"/><s n="pp"/><s n="f"/><s n="sg"/></\textbf{lf}>
976 % </\textbf{w}>
977 % <\textbf{w}>
978 % <\textbf{sf}>.</\textbf{sf}>
979 % <\textbf{lf}>.<s n="sent"/></\textbf{lf}>
980 % </\textbf{w}><\textbf{b}/>
981 % <\textbf{b}><![CDATA[
982 % </body>
983 % <html>]]></\textbf{b}>
984 % </document>
985 % \end{alltt}
986 % \end{small}
987 % \caption{Ejemplo de flujo segmentado con el formato encapsulado en XML,
988 % correspondiente al documento HTML de la figura~\ref{fg:docorig}.}
989 % \label{fg:docorigXML2}
990 % \end{figure}
991 %\subsection{Formato no XML}
992 %\subsubsection{Formato de flujo}
993 \label{se:noxml2} The symbols '\verb!^!' for word beginning and
994 '\verb!$!' for word end are used to delimit \textit{words}, as shown
995 in this example:
997 \begin{small}
998 \begin{alltt}
999 \verb!^!\textit{word}\verb!$!
1000 \end{alltt}
1001 \end{small}
1003 To separate the \textit{surface form} and the following
1004 \textit{lexical forms}, the symbol \texttt{/} is used. This separator
1005 only has sense in the ambiguous segmented stream, since in the
1006 unambiguous stream there is only the lexical form. It is used as
1007 follows:
1009 \begin{small}
1010 \begin{alltt}
1011 \verb!^!\textit{surface form}/\textit{lexical form 1}/...\verb!$!
1012 \end{alltt}
1013 \end{small}
1015 Lexical forms can include symbols (generally located at the end), as
1016 shown in the example of Figure \ref{fg:docorigtext2}.
1019 \begin{figure}
1020 \begin{small}
1021 \begin{alltt}
1022 [<html>
1023 <head>
1024 <title>]^Title/Title<n><m><sg>\$^./.<sent>\$[][</title>
1025 </head>
1026 <body>
1027 <p>]^Divided/Divide<vblex><pp>/Divided<vblex><past>\$[
1028 ]^sentence/sentence<n><sg>/sentence<vblex><inf>\$^./.\\<sent>\$[][</p>
1029 </body>
1030 <html>]
1031 \end{alltt}
1032 \end{small}
1033 \caption{Example of segmented stream with format encapsulated in
1034 non-XML format, corresponding to the HTML document in Figure
1035 ~\ref{fg:docorig}.}
1036 \label{fg:docorigtext2}
1037 \end{figure}
1043 \chapter{Modules specification}
1044 \label{se:especificmodulos}
1048 \section{Lexical processing modules}
1049 \label{ss:modproclex}
1051 \subsection{Module description }
1052 \label{ss:funcproclex}
1054 One of the most efficient approaches to lexical processing is based on
1055 the use of finite-state transducers (FST)
1056 \cite{mohri97a,roche97b}. FST are a type of finite-state automata,
1057 which may be used as one-pass morphological analysers and generators
1058 and may be very efficiently implemented. In this project, we have used
1059 a class of FST called letter-transducers
1060 \cite{roche97b,garrido02a,garrido99j}; in fact, any finite-state
1061 transducer may always be turned into a letter-transducer. Garrido and
1062 collaborators \cite{garrido99j,garrido02a} give a formal definition of
1063 the letter transducers used in this project; describing them
1064 informally, a letter-transducer is an idealised machine consisting of:
1065 \begin{enumerate}
1067 \item A (finite) set of states, that is, of situations in which the
1068 transducer can be while it is reading, from left to right, the input
1069 letters or symbols. Among the states of the set, we can distinguish:
1071 \begin{enumerate}
1072 \item A single initial state: this is the state in which the
1073 transducer is before processing the first letter or the first symbol
1074 of the input.
1075 \item One or more acceptance states, which are only reached after
1076 having completely read a valid entry and, therefore, are used to
1077 detect valid words.
1078 \end{enumerate}
1079 \item A set (also finite) of state transitions consisting of:
1080 \begin{enumerate}
1081 \item the origin state
1082 \item the destination state
1083 \item the input letter or symbol
1084 \item the output letter or symbol
1085 \end{enumerate} To make possible that input and output have different
1086 lengths at any time, it is allowed that there is no input symbol, that
1087 there is no output symbol or that there is neither input nor output
1088 symbol. This case is generally represented using a special symbol (the
1089 empty symbol).
1090 \end{enumerate}
1092 Every time the transducer reads an entry symbol, it creates a list of
1093 \emph{live} or \emph{active} states, each one of which has an
1094 associated output (a sequence of symbols). The way the letter
1095 transducer works is different for each type of lexical processing
1096 operation. For example, in the morphological analysis, the transducer
1097 tries to read the longest entry recognised by the dictionary
1098 (``left-to-right, longest-match'' mode).
1099 \begin{enumerate}
1100 \item Beginning: the set of live states is given a single live state:
1101 the initial state, with the empty word ("") as output associated to
1102 the state.
1103 \item When from one of the states in the current set of live states it
1104 is possible to reach other states through transitions that do not have
1105 input symbol, these states are added to the set of live states, and
1106 are associated to the output obtained when extending the associated
1107 outputs with the output symbol found in the corresponding
1108 transitions. This expansion operation of the set of live states
1109 continues until it is not possible to add more states.
1110 \item A symbol from the input word is read.
1111 \item A new set of live states is created, made with the states
1112 reached through transitions that have that symbol as input, and this
1113 states are associated to the outputs extended by adding the
1114 corresponding output symbols found in the transitions.
1115 \item If the current set has any live state, the process continues on
1116 step 2.
1117 \item The sets of live states are read backwards until a set is found
1118 which contains acceptance states. The morphological analyses will be
1119 the outputs associated to these states, and the reading position is
1120 set to the position immediately after this set (so that it can be
1121 processed again by the transducer in the next pass).
1122 \end{enumerate} Not all acceptance states have the same
1123 characteristics, and this fact adds more conditions to the acceptance
1124 process, in order to be able to deal with unknown words or with words
1125 that are joined to other words, as will be explained later.
1127 The transducer reads the input word only once on average, from right
1128 to left and symbol by symbol, and keeps a tentative list of possible
1129 partial outputs that is updated and pruned as the input is being
1130 read. When letter transducers are used as morphological analysers or
1131 as lemmatizers, they read a surface form and write the resulting
1132 lexical form(s). In this case, input symbols are the letters of the
1133 surface form, and output symbols are the letters needed to write the
1134 lemmas, as well as the letters and special symbols needed to represent
1135 the morphological analysis, such as in \texttt{<n>}, \texttt{<f>},
1136 \texttt{<2p>}, etc.
1138 The transducers work in a similar way for other lexical processing
1139 tasks.
1141 \nota{La noció de LRLM (left-to-right, longest-match) (o ODSCML,
1142 "izquierda a derecha, recortando el segmento concordante más largo")
1143 ha de quedar clara en el funcionamient del morfològic i del trànsfer
1144 estructural. Afegir coses de l'article de EAMT 2005.}
1146 \subsubsection{Letter case handling in dictionaries}
1147 \label{mayusc}
1149 The same input word in a lexical processing module can be written
1150 differently regarding letter case. The most frequent cases are:
1152 \begin{itemize}
1153 \item The whole word is in lower case.
1154 \item The whole word is in upper case.
1155 \item The first letter is capitalised and the rest is in lower case
1156 (typical case for proper nouns).
1157 \end{itemize}
1159 The transductions in the dictionary can also be found in these three
1160 states. The way in which one word is written in the dictionary is
1161 used to discard possible analysis of the word, according to the
1162 following rules:
1164 \begin{itemize}
1165 \item If the input letter is upper case and in the current analysis
1166 state there are concordant transitions in lower case, these
1167 transductions are made.
1168 \item If the input letter is lower case and in the current state there
1169 are not concordant transitions in lower case, the transductions are
1170 not made.
1171 \end{itemize}
1173 Thanks to this policy, a surface form that is not capitalised can not
1174 be analysed as a proper noun.
1176 The case of an input word will be maintained in the output of the
1177 translator unless it is decided not to do so. The case can be changed
1178 in the structural transfer module; this option is useful, for example,
1179 when there is a reordering of words or when a word is added before a
1180 capitalised word at the beginning of a sentence, such as in the
1181 translation of the Catalan phrase \emph{Vindran} into
1182 English: \emph{They will come}.
1185 \subsection{Data format: the dictionaries}
1186 \label{ss:diccionarios}
1187 \subsubsection{General criteria for dictionary design}
1189 The experience of the Transducens group at the Universitat d'Alacant
1190 in the creation of machine translation systems between Romance
1191 languages (\texttt{es}, \texttt{ca} and \texttt{pt}) already operative
1192 and publicly accessible has inspired the main characteristics of the
1193 whole shallow-transfer machine translation system described in this
1194 document, as well as its application to the Romance languages of Spain
1195 (\texttt{es}, \texttt{ca} and \texttt{gl}). In some sense, it could be
1196 stated that in the present project the only work was to adapt (rewrite
1197 in a standardised and interoperable format) the specifications and
1198 programs used in already operative projects.
1200 In particular, the design of the dictionaries has been based in an
1201 architecture that pretends to separate, as far as possible, the source
1202 language from the target language, even knowing that these
1203 dictionaries are translation-oriented and, therefore, that it is not
1204 advisable to elaborate them completely separately. The chosen format
1205 is used for the specification of both morphological dictionaries
1206 (monolingual) and bilingual dictionaries.
1208 The format for dictionaries, as well as for the rest of linguistic
1209 data (definition file for part-of-speech tagger and structural
1210 transfer rules) is XML\footnote{\url{http://www.w3.org/XML/}}, an
1211 international standard used in numerous natural language processing
1212 projects which, thanks to the availability of many utilities and
1213 libraries, it is becoming a very powerful tool for linguistic data
1214 representation and exchange (see article \cite{ide00}).
1218 Dictionaries are designed so that they can be compiled into
1219 \textit{letter transducers }, for efficiency reasons. For more
1220 information on letter transducers as a particular case of finite-state
1221 transducers, see Section \ref{ss:funcproclex} or the article
1222 \cite{garrido02a}.
1224 The letter transducers that are generated from the system dictionaries
1225 (morphological, bilingual and post-generation dictionaries) process
1226 input character strings to produce output strings. According to this,
1227 dictionaries are made of entries consisting of string pairs that
1228 correspond to the inputs and outputs of the transducer.
1231 The most powerful tool in these dictionaries is the definition and use
1232 of \emph{paradigms}. Since in Romance languages a lot of lemmas share
1233 the same inflection pattern (there are regularities in their
1234 inflection), it is useful and straightforward to group these
1235 regularities in inflection paradigms to avoid having to write all the
1236 forms of every word. Paradigms allow the representation of dictionary
1237 entries compactly and help optimise the speed for building a
1238 dictionary. Once the most frequent paradigms in a dictionary are
1239 defined, the linguist does not need to bother, in most of the cases,
1240 with the whole inflection of a new term, since entering an inflective
1241 word is generally limited to writing the lemma and choosing one
1242 inflection pattern among the previously defined paradigms.
1243 Furthermore, the use of paradigms reduces the memory requisites,
1244 facilitates the construction of efficient letter transducers and
1245 speeds up the compilation process \cite{ortiz05j}. We did not use
1246 paradigms in bilingual dictionaries (although it is possible to)
1247 because most of the inflection information is processed implicitly in
1248 these dictionaries, as explained in page~\pageref{ss:bil}.
1252 \subsubsection{Dictionary types}
1254 In our system there are three types of dictionaries: morphological
1255 (monolingual) dictionaries for each of the languages involved
1256 (Spanish, Catalan and Galician); bilingual dictionaries for the
1257 different translation pairs (Spanish--Catalan and Spanish--Galician),
1258 and post-generation dictionaries for each of the languages (a
1259 post-generation dictionary is not a typical dictionary, with lemmas
1260 and morphological information, but is like a little dictionary of the
1261 orthographic transformations that may undergo words when they come
1262 together). The structure of the three dictionary types is specified
1263 by the same DTD (\emph{Document Type Definition}), which can be found
1264 in Appendix \ref{ss:dtd_dics}.
1267 \textbf{Morphological dictionaries} are used both for building
1268 morphological analysers ---the translation system module used to
1269 obtain all the possible lexical forms for a certain surface form in
1270 the source language --- and morphological generators
1271 ---the module that generates the surface form in the target language
1272 from the lexical form of each word---. These two modules are obtained
1273 from a single morphological dictionary, depending on the direction in
1274 which it is read by the system: read from left to right, we obtain the
1275 analyser, and read from right to left, the generator.
1277 The block structure typical for these dictionaries is the following:
1279 \begin{itemize}
1280 \item \textit{An alphabet definition}. This definition is used
1281 exclusively for building the morphological analyser; specifically, it
1282 enables the morphological analyser to appropriately tokenize unknown
1283 words and the ones in the conditional sections (see the description of
1284 the element \texttt{<section>} in page \pageref{ss:section}); the
1285 morphological generator does not need this definition.
1287 \item \textit{A definition of symbols}. It consists of a declaration
1288 of the grammatical symbols that will be used in dictionary entries
1289 (you can find in Appendix \ref{se:simbolosmorf} a list with the
1290 grammatical symbols used in this project).
1291 \item \textit{A definition of paradigms}. Paradigms need to be
1292 defined here in order to be used in the dictionary sections or in other
1293 paradigms.
1294 \item \textit{One or more dictionary sections with conditional
1295 tokenization}, type \texttt{standard}. To include most of the words
1296 of the dictionary.
1297 \item \textit{One or more dictionary sections with unconditional
1298 tokenization}. To include certain words that follow a regular
1299 pattern or that are tokenized regardless the text directly after
1300 them (see description of the element \texttt{<section>} in page
1301 \pageref{ss:section}). In the Catalan morphological dictionaries,
1302 words requiring an unconditional tokenization are distributed in two
1303 sections: one for the forms that require the introduction of a blank
1304 immediately after (due to processing requirements of the lexical
1305 forms), like the apostrophized forms \emph{l'} or \emph{d'}, and
1306 another one for punctuation marks, numbers and other signs.
1308 \end{itemize}
1310 \textbf{Bilingual dictionaries} represent in the system the lexical
1311 transfer process, that is, the assignment of the TL lexical form that
1312 corresponds to each SL lexical form. Two \emph{products} are obtained
1313 from each bilingual dictionary, depending on the direction in which it
1314 is read by the system: when the dictionary is read from left to right,
1315 we obtain the lexical transfer module in one translation direction,
1316 and when it is read from right to left, in the other direction. For
1317 the bilingual dictionaries of our project, it has been established
1318 that Spanish will be put always on the left side of the entries, and
1319 the rest of the languages (Catalan and Galician), on the right
1320 side. Thus, for example, the bilingual Spanish--Galician dictionary
1321 will be read from left to right for the translation
1322 \texttt{es}--\texttt{gl} and from right to left for the translation
1323 \texttt{gl}--\texttt{es}. In applications like the ones in this
1324 project, these dictionaries do not have paradigms: they are build with
1325 generic entries which almost always have no more information than
1326 lemma and part of speech, and there is no inflection information.
1328 The block structure used in the bilingual dictionaries of this project
1329 is the following:
1331 \begin{itemize}
1332 \item \textit{A definition of symbols}. It consists of a declaration
1333 of the grammatical symbols that will be used in dictionary entries.
1334 \item \textit{A single dictionary section}. Where bilingual
1335 correspondences are specified.
1336 \end{itemize}
1338 Since 2007, bilingual dictionaries allow the specification of more
1339 than one TL translation, so that a lexical selection module (see
1340 Section \ref{se:seleccio_lex}) can choose the most suitable equivalent
1341 according to the context. To that end, an attribute has been added to
1342 bilingual dictionaries. You can find its description in section
1343 \ref{dic_lextor}.
1346 \textbf{Post-generation dictionaries} are used to perform some
1347 transformations (orthographic changes, contractions, apostrophation,
1348 etc.) required after surface forms in the target language have been
1349 generated and come into contact with each other. Since this kind of
1350 operations can be expressed as a translation of character strings, it
1351 has been decided to use the same type of dictionaries. It is
1352 implicitly assumed that the parts of the text whose processing has not
1353 been specified are copied just as they arrive. In these dictionaries,
1354 the definition of paradigms is useful to express systematic changes in
1355 the word contact phenomena. Unlike the other dictionary types, these
1356 do not include grammatical symbols, since they process surface forms.
1358 The block structure of post-generation dictionaries is the following:
1359 \begin{itemize}
1361 \item \textit{A definition of paradigms}. To use in entries.
1362 \item \textit{A dictionary section}. Where the patterns for
1363 post-generation operations are specified.
1364 \end{itemize}
1367 The following table contains an overview of the possible reading
1368 directions of dictionaries and their application to the Romance
1369 languages in this project:
1371 \begin{center}
1372 \begin{tabular}{|l|l|l|}
1373 \hline
1374 Dictionary & Reading direction & Function \\
1375 \hline
1376 Morphological & left--right & analysis for \texttt{es}, \texttt{ca} and \texttt{gl}\\
1377 & right--left & generation for \texttt{es}, \texttt{ca} and \texttt{gl}\\\hline
1378 Bilingual & left--right & translation for \texttt{es-ca} and \texttt{es-gl}\\
1379 & right--left & translation for \texttt{ca-es} and \texttt{gl-es}\\\hline
1380 Post-generation & left--right & post-generation for \texttt{ca}, \texttt{es} and \texttt{gl}\\\hline
1382 \end{tabular}
1383 \end{center}
1387 \subsubsection{Description of the dictionary format}
1388 \label{formatodics} This section presents the main elements of the
1389 format in which dictionaries are build. The formal definition (a DTD)
1390 can be found in Appendix ~\ref{ss:dtd_dics}. Section \ref{dic_lextor}
1391 describes the characteristics of a bilingual dictionary that works in
1392 an Apertium system with lexical selection module. Finally, from pages
1393 \pageref{ss:morfgen} to %\pageref{ss:bil} y
1394 \pageref{ss:postgen} there
1395 is a description of the different particularities of entries for the
1396 three dictionary types (morphological, bilingual and post-generation).
1400 \paragraph{Element for dictionary \texttt{<dictionary>}}
1402 This is the root element and includes the whole dictionary. It
1403 contains an alphabetic character definition, a definition of symbols
1404 (which are the morphological tags for the words), a definition of
1405 inflection paradigms and one or more dictionary sections, which
1406 contain the entries for the lexical forms (consisting of pairs made of
1407 surface form--lexical form). Figure \ref{fig:dictionary} shows the
1408 basic block structure of a generic dictionary.
1410 \begin{figure}
1411 \begin{small}
1412 \begin{alltt}
1413 <?\textbf{xml} \textsl{version}="1.0" \textsl{encoding}="iso-8859-15"?>
1414 <\textbf{dictionary}>
1415 <\textbf{alphabet}>abcdefghijk ... ABCDEFGH ... çñáéíóú</\textbf{alphabet}>
1416 <\textbf{sdefs}>
1417 <!-- ... -->
1418 </\textbf{sdefs}>
1419 <\textbf{pardefs}>
1420 <!-- ... -->
1421 </\textbf{pardefs}>
1422 <\textbf{section} ...>
1423 <!-- ... -->
1424 </\textbf{section}>
1425 <!-- ... -->
1426 </\textbf{dictionary}>
1427 \end{alltt}
1428 \end{small}
1429 \caption{Use of the elements \texttt{<\textbf{dictionary}>} and
1430 \texttt{<\textbf{alphabet}>}}
1431 \label{fig:dictionary}
1432 \end{figure}
1435 \paragraph{Element for alphabet \texttt{<alphabet>}}
1437 It is used to specify a definition of alphabetic characters. The
1438 purpose of this specification is enabling the modules that process the
1439 input by means of letter transducers to tokenize it in individual
1440 words.\nota{Parlar dels mots desconeguts. Cita \ref{ss:section} -
1441 Mikel?}
1443 In the present design, the definition of an alphabet only has sense in
1444 morphological dictionaries, since it is needed for the
1445 analysis. Figure \ref{fig:dictionary} shows a use example for this
1446 element.
1449 \paragraph{Element for symbol definition section \texttt{<sdefs>}} It
1450 groups all the symbol definitions in a dictionary
1451 (\texttt{<\textbf{sdef}>}). There is an example of its use in Figure
1452 \ref{fig:sdefs}.
1454 \paragraph{Element for symbol definition \texttt{<sdef>}}
1456 It is an empty element (it does not delimit any content): it is used
1457 to specify, through the values of the attribute \texttt{\textsl{n}},
1458 the names of the grammatical symbols that are used in the dictionary
1459 to morphologically label lexical forms. In Figure \ref{fig:sdefs} you
1460 can find a use example for this element. Refer to Appendix
1461 \ref{se:simbolosmorf} if you need a list with all the grammatical
1462 symbols used in the dictionaries of this project.
1464 \begin{figure}
1465 \begin{small}
1466 \begin{alltt}
1467 <\textbf{sdefs}>
1468 <\textbf{sdef} \textsl{n}="n"/>
1469 <\textbf{sdef} \textsl{n}="det"/>
1470 <\textbf{sdef} \textsl{n}="sg"/>
1471 <\textbf{sdef} \textsl{n}="pl"/>
1472 <!-- ... -->
1473 </\textbf{sdefs}>
1474 \end{alltt}
1475 \end{small}
1476 \caption{Use of the element \texttt{<\textbf{sdefs}>}}
1477 \label{fig:sdefs}
1478 \end{figure}
1480 \paragraph{Element for dictionary section \texttt{<section>}}
1481 \label{ss:section}
1483 It contains the words that will be recognised by the dictionary. The
1484 reason to divide a dictionary in sections is that some forms ---for
1485 example, the ones coming from the identification of certain regular
1486 patterns, or some forms that pertain to a specific dialect--- may need
1487 a different processing.
1489 One of the problems that the definition of sections in a dictionary
1490 helps to solve is the tokenization procedure during morphological
1491 analysis. Most of the forms are tokenized following a conditional
1492 criterion: identifying if the character being processed is followed by
1493 a non-alphabetic character ---that is, not defined in
1494 \texttt{<\textbf{alphabet}>}---. However, there are other forms, like
1495 the Catalan apostrophized words \emph{l'} or \emph{d'}, that need an
1496 unconditional tokenization model: there is no need to analyse what
1497 comes after them, since, if it is an alphabetic character, it will
1498 belong to the \textit{next} word. The forms that require unconditional
1499 tokenization are included in a specific section of the
1500 dictionary. Other kinds of processing can also be solved through these
1501 divisions.
1505 The value of the attribute \texttt{\textsl{type}} is used to express
1506 the kind of string tokenization applied in each dictionary section:
1507 the possible values of this attribute are: \texttt{standard}, for
1508 almost all the forms of the dictionary (conditional mode),
1509 \texttt{postblank}, for the forms that require an unconditional
1510 tokenization and the placing of a blank, and \texttt{inconditional}
1511 for the rest of forms that require unconditional tokenization.
1513 The attribute \texttt{\textsl{id}} is used to assign an identifier (a
1514 name) to the dictionary sections.
1516 \begin{figure}
1517 \begin{small}
1518 \begin{alltt}
1519 <\textbf{section} \textsl{id}="principal" \textsl{type}="standard">
1520 <!-- ... -->
1521 </\textbf{section}>
1522 <\textbf{section} \textsl{id}="patterns" \textsl{type}="inconditional">
1523 <!-- ... -->
1524 </\textbf{section}>
1525 \end{alltt}
1526 \end{small}
1527 \caption{Use of the element \texttt{<\textbf{section}>}}
1528 \label{fig:section}
1529 \end{figure}
1531 \paragraph{Element for entries \texttt{<e>}}
1533 An entry is the basic unit of a dictionary or of a paradigm
1534 definition. Entries consist of a concatenation in any order of string
1535 pairs \texttt{<\textbf{p}>}, identity transductions
1536 \texttt{<\textbf{i}>}, references to paradigm \texttt{<\textbf{par}>}
1537 or regular expressions \texttt{<\textbf{re}>}. The structure and
1538 meaning of these elements is explained later in this section (in pages
1539 ~\pageref{ss:p}, \pageref{ss:i}, \pageref{ss:par} and \pageref{ss:re}
1540 respectively).
1542 \label{restric}Two optional attributes are used with this entry. The
1543 first one is \texttt{\textsl{r}} (for \textit{restriction}), which
1544 specifies if the entry has to be considered only when reading the
1545 dictionary from left to right (\texttt{LR}) or when reading it from
1546 right to left (\texttt{RL}). If nothing is specified, it is assumed
1547 that the entry must be considered in both directions.
1549 In morphological dictionaries, the restriction \texttt{LR} causes that
1550 a LF is analysed but not generated (for example, when the LF belongs
1551 to a dialectal variant that we wish to recognise but not to generate)
1552 and the restriction \texttt{RL} causes that a word is generated but not
1553 analysed (needed, for example, for forms with post-generator
1554 activation mark, see page \pageref{ss:a} for more details).
1556 In bilingual dictionaries, the restrictions \texttt{LR} and
1557 \texttt{RL} cause that the translation is done only in one direction:
1558 for example, in a bilingual \texttt{es}--\texttt{ca} dictionary,
1559 \texttt{LR} indicates that the LF is only translated from Spanish to
1560 Catalan, and \texttt{RL} only from Catalan to Spanish. Let's
1561 illustrate it with an example: the Spanish adverbs \emph{aún} and
1562 \emph{todavía} ("still") are translated into Catalan as the same word,
1563 \emph{encara}. We can only translate the Catalan adverb \emph{encara}
1564 as one of both words into Spanish (there is no difference in meaning);
1565 we decide to translate it as \emph{todavía}. In this case, we have to
1566 write two entries in the bilingual dictionary: the entry that matches
1567 \emph{aún} with \emph{encara} needs to have the restriction
1568 \texttt{LR} (translation only from \texttt{es} to \texttt{ca}) and the
1569 one that matches \emph{todavía} with \emph{encara} does not need to
1570 have any restriction (translation in both directions).
1572 Direction restrictions are also necessary in bilingual dictionaries
1573 when we have words with gender to be determined ("GD") or number to be
1574 determined ("ND") (consult page ~\pageref{ss:bil} for more
1575 information).
1577 The other optional attribute in entries is the lemma name
1578 \texttt{\textsl{lm}}. Due to the employment of paradigms to represent
1579 the inflection regularities of lexical units, an entry in
1580 morphological dictionaries contains the part of the lemma that is
1581 common to all the inflected forms, that is, it contains the lemma cut
1582 at the point in which the paradigm regularity begins (for example, the
1583 Spanish adjectives \emph{distinto}, \emph{absoluto} and \emph{marino}
1584 appear in entries as \emph{distint}, \emph{absolut} and \emph{marin},
1585 since the rest of the inflected forms is common to all of them and
1586 specified in a paradigm). This fact can make the dictionary difficult
1587 to understand. Therefore entries have this attribute, which contains
1588 the whole lemma of the lexical form, so that the dictionary becomes
1589 more understandable and linguists can solve problems quickly. In
1590 bilingual dictionaries, which normally do not have references to
1591 paradigms,\footnote{They could have references to paradigms, but we
1592 did not judge it necessary for the languages involved \nota{atenció:
1593 ex--, vice--?}.} this attribute is not used.
1596 \paragraph{Element for string pair \texttt{<p>}}
1597 \label{ss:p}
1599 This basic element of dictionaries is used in any kind of entry to
1600 indicate the correspondence between two strings; this
1601 correspondence specifies a lexical transformation that will be carried
1602 out by a state path in the resulting finite-state transducer
1603 \cite{garrido99j}.
1605 It is defined by a pair of internal elements: The left element
1606 (\texttt{<\textbf{l}>}) and the right element (\texttt{<\textbf{r}>}).
1607 Its structure is shown in Figure \ref{fig:p}.
1609 \begin{figure}
1610 \begin{small}
1611 \begin{alltt}
1612 <\textbf{p}>
1613 <\textbf{l}><!-- ... --></\textbf{l}>
1614 <\textbf{r}><!-- ... --></\textbf{r}>
1615 </\textbf{p}>
1616 \end{alltt}
1617 \end{small}
1618 \caption{Use of the element \texttt{<\textbf{p}>}}
1619 \label{fig:p}
1620 \end{figure}
1622 A pair \texttt{<\textbf{p}>} must include these two parts although one
1623 can be empty, which means deleting (or inserting) a string. The
1624 elements \texttt{<\textbf{l}>} and \texttt{<\textbf{r}>} have the same
1625 internal structure and the same requisites. They can contain text and
1626 references to grammatical symbols (which, for the languages of the
1627 present project, inflected by suffixation, are usually placed at the
1628 end in any amount). Outside the tags \texttt{<\textbf{l}>} and
1629 \texttt{<\textbf{r}>} of a string pair there is nothing.
1632 \paragraph{Element for reference to symbol \texttt{<s>}}
1634 References to symbols (or tags) are used to specify the morphological
1635 information of a LF and are used in any place inside a string pair,
1636 that is, inside the elements \texttt{<\textbf{l}>} and
1637 \texttt{<\textbf{r}>}, as if they were individual characters; for the
1638 languages of our project, however, they are put at the end of the
1639 pairs and always in the same order for the same word type. This order
1640 is decided by the linguist according to how he/she wishes to
1641 characterise morphologically the LF in the dictionaries, and must be
1642 the same in all the dictionaries of a system if we want that the
1643 lexical and structural transfer operations work correctly. So, for
1644 example, in the Romance language dictionaries of this project, a noun
1645 has in the first place the symbol for part of speech (\textit{n},
1646 noun), then for gender (\textit{m}, masculine, \textit{f}, feminine,
1647 \textit{mf}, masculine--feminine), and finally for number
1648 (\textit{sg}, singular, \textit{pl}, plural, \textit{sp},
1649 singular--plural). The list in Appendix \ref{se:simbolosmorf}
1650 contains all the grammatical symbols used in the dictionaries of this
1651 project and shows the order which has been established for each type
1652 of word.
1654 In morphological dictionaries, references to symbols are used in
1655 paradigms as well as in entries which do not include any reference to
1656 a paradigm. In bilingual dictionaries, usually only the first symbol
1657 of each LF is specified, since the rest is automatically copied from
1658 the source language LF to the target language LF (in the case they are
1659 identical in both languages).
1661 To specify which symbol we are referring to, we use the (mandatory)
1662 attribute \texttt{\textsl{n}}. The symbol must be defined in the
1663 symbol definition section (\texttt{<\textbf{sdefs}>}).
1668 \paragraph{Element for identity transduction \texttt{<i>}}
1669 \label{ss:i}
1671 It is a way to write a string pair in which left side and right side
1672 are identical. For example, the two entries shown in Figure
1673 \ref{fig:i} are completely equivalent. The advantage of writing
1674 entries with this element is that the result is more compact and more
1675 readable.
1677 \begin{figure}
1678 \begin{small}
1679 \begin{alltt}
1682 <\textbf{e} \textsl{lm}="perro">
1683 <\textbf{p}>
1684 <\textbf{l}>perr</\textbf{l}><\textbf{r}>perr</\textbf{r}>
1685 </\textbf{p}>
1686 <\textbf{par} \textsl{n}="abuel_o__n"/>
1687 </\textbf{e}>
1691 <\textbf{e} \textsl{lm}="perro">
1692 <\textbf{i}>perr</\textbf{i}>
1693 <\textbf{par} \textsl{n}="abuel_o__n"/>
1694 </\textbf{e}>
1695 \end{alltt}
1696 \end{small}
1697 \caption{Use of the element \texttt{<\textbf{i}>} entries [1] and [2]
1698 are equivalent}
1699 \label{fig:i}
1700 \end{figure}
1704 \paragraph{Element for paradigm definition section \texttt{<pardefs>}}
1706 This element includes all the paradigm definitions of a dictionary,
1707 each definition in an element \texttt{<\textbf{pardef}>}, as shown in
1708 Figure \ref{fig:pardefs}.
1710 \begin{figure}
1711 \begin{small}
1712 \begin{alltt}
1713 <\textbf{pardefs}>
1714 <\textbf{pardef} \textsl{n}="abuel_o__n">
1715 <!-- ... -->
1716 </\textbf{pardef}>
1717 <!-- ... -->
1718 </\textbf{pardefs}>
1719 \end{alltt}
1720 \end{small}
1721 \caption{Use of the element \texttt{<\textbf{pardefs}>}}
1722 \label{fig:pardefs}
1723 \end{figure}
1727 \paragraph{Element for paradigm definition \texttt{<pardef>}}
1730 It defines an inflection paradigm in the dictionary. A paradigm can
1731 be understood as a small dictionary of alternative transformations
1732 that can be concatenated to parts of words (or to entries of another
1733 paradigm) to specify regularities in the lexical processing of the
1734 dictionary entries, such as inflection regularities. To specify these
1735 regularities, each paradigm is a list of entries \texttt{<\textbf{e}>}
1736 like the ones in the dictionary, that is, it has the same structure as
1737 a dictionary section \texttt{<\textbf{section}>}; therefore, paradigm
1738 entries consist of a pair (\texttt{<\textbf{p}>}) with left side
1739 (\texttt{<\textbf{l}>}) and right side (\texttt{<\textbf{r}>}). These
1740 elements can contain text or grammatical symbols
1741 \texttt{<\textbf{s}>}.
1744 As in symbol definitions, paradigm definitions have an attribute
1745 \texttt{\textsl{n}} which specifies the paradigm name, so that it can
1746 be referred to inside dictionary entries. In a dictionary entry,
1747 therefore, one only needs to indicate the corresponding paradigm name
1748 in order that all its possible forms get specified.
1750 The example of paradigm definition pointed out in Figure
1751 \ref{fig:pardefs} appears developed in Figure \ref{fig:pardef}. The
1752 following table shows the information expressed by the paradigm:
1754 \begin{center}
1755 \begin{tabular}{|l|c|l|}
1756 \hline
1757 Root (SF and LF) & Ending (SF) & Analysis (LF) \\
1758 \hline
1759 \texttt{abuel} & \texttt{o} &\texttt{o<n><m><sg>}\\
1760 \texttt{abuel} & \texttt{a} &\texttt{o<n><f><sg>}\\
1761 \texttt{abuel} & \texttt{os} &\texttt{o<n><m><pl>}\\
1762 \texttt{abuel} & \texttt{as} &\texttt{o<n><f><pl>}\\
1763 \hline
1764 \end{tabular}
1765 \end{center}
1768 \begin{figure}
1769 \begin{small}
1770 \begin{alltt}
1771 <\textbf{pardef} \textsl{n}="abuel_o__n">
1772 <\textbf{e}>
1773 <\textbf{p}>
1774 <\textbf{l}>o</\textbf{l}>
1775 <\textbf{r}>o<\textbf{s} \textsl{n}="n"/><\textbf{s} \textsl{n}="m"/><\textbf{s} \textsl{n}="sg"/></\textbf{r}>
1776 </\textbf{p}>
1777 </\textbf{e}>
1778 <\textbf{e}>
1779 <\textbf{p}>
1780 <\textbf{l}>a</\textbf{l}>
1781 <\textbf{r}>o<\textbf{s} \textsl{n}="n"/><\textbf{s} \textsl{n}="f"/><\textbf{s} \textsl{n}="sg"/></\textbf{r}>
1782 </\textbf{p}>
1783 </\textbf{e}>
1784 <\textbf{e}>
1785 <\textbf{p}>
1786 <\textbf{l}>os</\textbf{l}>
1787 <\textbf{r}>o<\textbf{s} \textsl{n}="n"/><\textbf{s} \textsl{n}="m"/><\textbf{s} \textsl{n}="pl"/></\textbf{r}>
1788 </\textbf{p}>
1789 </\textbf{e}>
1790 <\textbf{e}>
1791 <\textbf{p}>
1792 <\textbf{l}>as</\textbf{l}>
1793 <\textbf{r}>o<\textbf{s} \textsl{n}="n"/><\textbf{s} \textsl{n}="f"/><\textbf{s} \textsl{n}="pl"/></\textbf{r}>
1794 </\textbf{p}>
1795 </\textbf{e}>
1796 </\textbf{pardef}>
1797 \end{alltt}
1798 \end{small}
1799 \caption{Use of the element \texttt{<\textbf{pardef}>} to define the
1800 inflective morphology of Spanish nouns with four endings, such as
1801 \emph{abuelo, -a, -os, -as} ("grandfather, grandmother") }
1802 \label{fig:pardef}
1803 \end{figure}
1805 This paradigm is assigned to all Spanish nouns (\texttt{n}) that
1806 inflect like \emph{abuelo}, such as \emph{alumno}, \emph{amigo} or
1807 \emph{gato}, and is designed to be used as a \textit{suffix} in
1808 dictionary entries. In general, paradigms can be applied to any
1809 position of a dictionary entry (if it makes sense, of course). We can
1810 think of paradigms as transducers that are inserted at the point where
1811 they are specified. Figure \ref{fig:pardef2} shows an example of paradigm
1812 defined to be used as a prefix. It is the paradigm used to analyse and
1813 generate Spanish words beginning with \emph{ex}, \emph{ex-}, etc.,
1814 like \emph{ex-presidente}, \emph{exministro}, \emph{ex director},
1815 etc., with all the orthographic variations (\emph{ex} with hyphen,
1816 without hyphen and joined, without hyphen and with a blank
1817 \texttt{<\textbf{b}/>}, see page~\ref{s3:b}); the output lemma simply
1818 adds \emph{ex} without hyphen nor blank to the accompanying lemma. The
1819 direction restrictions (\texttt{"LR"}) that appear in the example are
1820 used to determine which form will the translator generate. The empty
1821 identity transduction (\texttt{<\textbf{i}/>}) is necessary in this
1822 case to analyse and generate the word without the prefix \emph{ex}.
1824 \begin{figure}
1825 \begin{small}
1826 \begin{alltt}
1827 <\textbf{pardef} \textsl{n}="ex">
1828 <\textbf{e} \textsl{r}="LR"><\textbf{p}><\textbf{l}>ex<\textbf{b}/></\textbf{l}><\textbf{r}>ex</\textbf{r}></\textbf{p}></\textbf{e}>
1829 <\textbf{e}><\textbf{i}>ex</\textbf{i}></\textbf{e}>
1830 <\textbf{e} \textsl{r}="LR"><\textbf{p}><\textbf{l}>ex-</\textbf{l}><\textbf{r}>ex</\textbf{r}></\textbf{p}></\textbf{e}>
1831 <\textbf{e}><\textbf{i}/></\textbf{e}>
1832 </\textbf{pardef}>
1833 \end{alltt}
1834 \end{small}
1835 \caption{Use of the element \texttt{<\textbf{pardef}>} in the paradigm
1836 for the prefix \emph{ex}.}
1837 \label{fig:pardef2}
1838 \end{figure}
1841 Entries in a paradigm can contain references to other paradigms
1842 provided that these have been defined upper in the file. On the other
1843 hand, for the moment a paradigm definition can not include itself
1844 neither directly nor indirectly.
1846 Paradigms are used in morphological dictionaries for the analysis and
1847 generation of lexical forms. For the language pairs of this project,
1848 there is no need to define paradigms in bilingual dictionaries (see
1849 page~\pageref{ss:bil}).
1851 From Apertium 2 on, there is a new type of paradigm, called
1852 metaparadigm, that allows the definition of paradigms with variations
1853 according to the value of an attribute specified in each entry that
1854 refers to that paradigm. Section \ref{ss:metaparadigmas} describes the
1855 characteristics and use of metaparadigms.
1859 \paragraph{Element for reference to a paradigm \texttt{<par>}}
1860 \label{ss:par}
1862 It is used inside an entry to indicate which inflection paradigm,
1863 among the ones defined in \texttt{<\textbf{pardefs}>}, follows the
1864 entry. Thanks to the references to paradigms there is no need to write
1865 all the inflected forms of a lemma in a morphological dictionary
1866 entry. The attribute \texttt{\textsl{n}} is used to specify the name
1867 of the paradigm we want to refer to.
1869 The result of inserting a reference to a paradigm in an entry is the
1870 creation of so many string pairs as cases specified in the
1871 paradigm. For example, the entry in Figure \ref{fig:par}, with a
1872 reference to the paradigm "\texttt{abuel\_o\_\_n}" (defined in Figure
1873 \ref{fig:pardef}), is equivalent to an entry where each string pair of
1874 the paradigm is concatenated to the lemma (that is, an entry with
1875 every inflected form of the lemma), as shown in Figure
1876 \ref{fig:lema_par}. In this figure, you can see that the paradigm
1877 delivers always in the right string (\texttt{<\textbf{r}>}) the lemma
1878 (\emph{perro}) with the grammatical symbols that apply to the surface
1879 form, since it is from the lemma that transfer operations are carried
1880 out.
1883 The appropriate use of paradigms, besides enabling the creation of
1884 compact dictionaries, improves compilation speed and reduces memory
1885 requirements during this process, since in compilation it is possible
1886 to create a single data structure for each one of most paradigms
1887 \cite{ortiz05j}.
1889 \begin{figure}
1890 \begin{small}
1891 \begin{alltt}
1892 <\textbf{e} \textsl{lm}="perro">
1893 <\textbf{i}>perr</\textbf{i}>
1894 <\textbf{par} \textsl{n}="abuel_o__n"/>
1895 </\textbf{e}>
1896 \end{alltt}
1897 \end{small}
1898 \caption{Use of the element \texttt{<\textbf{par}>}}
1899 \label{fig:par}
1900 \end{figure}
1902 \begin{figure}
1903 \begin{small}
1904 \begin{alltt}
1905 <\textbf{e}>
1906 <\textbf{p}>
1907 <\textbf{l}>perro</\textbf{l}>
1908 <\textbf{r}>perro<\textbf{s} \textsl{n}="n"/><\textbf{s} \textsl{n}="m"/><\textbf{s} \textsl{n}="sg"/></\textbf{r}>
1909 </\textbf{p}>
1910 </\textbf{e}>
1911 <\textbf{e}>
1912 <\textbf{p}>
1913 <\textbf{l}>perra</\textbf{l}>
1914 <\textbf{r}>perro<\textbf{s} \textsl{n}="n"/><\textbf{s} \textsl{n}="f"/><\textbf{s} \textsl{n}="sg"/></\textbf{r}>
1915 </\textbf{p}>
1916 </\textbf{e}>
1917 <\textbf{e}>
1918 <\textbf{p}>
1919 <\textbf{l}>perros</\textbf{l}>
1920 <\textbf{r}>perro<\textbf{s} \textsl{n}="n"/><\textbf{s} \textsl{n}="m"/><\textbf{s} \textsl{n}="pl"/></\textbf{r}>
1921 </\textbf{p}>
1922 </\textbf{e}>
1923 <\textbf{e}>
1924 <\textbf{p}>
1925 <\textbf{l}>perras</\textbf{l}>
1926 <\textbf{r}>perro<\textbf{s} \textsl{n}="n"/><\textbf{s} \textsl{n}="f"/><\textbf{s} \textsl{n}="pl"/></\textbf{r}>
1927 </\textbf{p}>
1928 </\textbf{e}>
1929 \end{alltt}
1930 \end{small}
1931 \caption{Entry equivalent to the one in Figure \ref{fig:par}, that
1932 shows the result of inserting the reference to paradigm
1933 \texttt{<\textbf{par}>} with the paradigm defined in Figure
1934 \ref{fig:pardef}.}
1935 \label{fig:lema_par}
1936 \end{figure}
1940 \paragraph{Element for regular expression \texttt{<re>}}
1941 \label{ss:re}
1943 In natural languages too there are patterns that can be recognized as
1944 regular expressions: for example, punctuation marks, numbers (Latin or
1945 Roman), e-mail or web page addresses, or any kind of code identifiable
1946 through these mechanisms.
1949 For this cases we use the string contained in the tag
1950 \texttt{<\textbf{re}>}. The compiler reads the regular expression
1951 definition and transforms it in a transducer that is inserted in the
1952 rest of the dictionary and that translates all the strings that match
1953 the expression into identical strings.
1955 The syntax of the present implementation of these regular expressions
1956 processes a subgroup of Unix regular expressions, which includes the
1957 operators \texttt{*}, \texttt{?}, \texttt{|} and \texttt{+}, as well
1958 as groupings through parentheses and optional character ranks, for
1959 example \texttt{[a-zA-zñú]} or its negated versions, like
1960 \verb![^a-z]!.
1962 By analogy, they can be seen as \texttt{<\textbf{i}>} elements, with
1963 the difference that they can identify strings which may be infinite
1964 (like numbers).
1966 \begin{figure}
1967 \begin{small}
1968 \begin{alltt}
1969 <\textbf{e}>
1970 <\textbf{re}>[0-9]+([.,][0-9]+)?(\%)?</\textbf{re}>
1971 <\textbf{p}><\textbf{l}/><\textbf{r}><\textbf{s} \textsl{n}="num"/></\textbf{r}></\textbf{p}>
1972 </\textbf{e}>
1973 \end{alltt}
1974 \end{small}
1975 \caption{Us of the element \texttt{<\textbf{re}>} in an entry for the
1976 detection of Arabic numbers.}
1977 \label{fig:e}
1978 \end{figure}
1980 Figure \ref{fig:e} shows the way to tag quantities expressed as Arabic
1981 numbers in the dictionary.
1984 \paragraph{Element for blank block \texttt{<b>}}
1985 \label{s3:b}
1987 It is used to express the presence of blanks between the words of a
1988 multiword (see page~\pageref{ss:multipalabras} for an explanation on
1989 multiwords). It can be inserted in the \texttt{<\textbf{i}>},
1990 \texttt{<\textbf{l}>} and \texttt{<\textbf{r}>} elements. In Figure
1991 \ref{fig:b} you can see the entry for the Spanish multiword expression
1992 \emph{hoy en día} ("nowadays"): the blanks between words are expressed
1993 as \texttt{<\textbf{b}/>} elements inside the left and right strings.
1994 \begin{figure}
1995 \begin{small}
1996 \begin{alltt}
1997 <\textbf{e} \textsl{lm}="hoy en día">
1998 <\textbf{p}>
1999 <\textbf{l}>hoy<\textbf{b}/>en<\textbf{b}/>día</\textbf{l}>
2000 <\textbf{r}>hoy<\textbf{b}/>en<\textbf{b}/>día<\textbf{s} \textsl{n}="adv"/></\textbf{r}>
2001 </\textbf{p}>
2002 </\textbf{e}>
2003 \end{alltt}
2004 \end{small}
2005 \caption{Use of the element \texttt{<\textbf{b}>}}
2006 \label{fig:b}
2007 \end{figure}
2009 Blanks can consist of normal space characters or of document format
2010 information blocks encapsulated by the de-formatter
2011 (\textit{superblanks}, see Section \ref{ss:formato}).
2014 \paragraph{Element for post-generator activation \texttt{<a>}}
2015 \label{ss:a} The element \texttt{<\textbf{a}>} for the activation of
2016 the post-generator is used to indicate that a word in target language
2017 may undergo orthographic transformations due to the contact with other
2018 words; for example, being apostrophized, contracted, written without
2019 intermediate spaces, etc. These transformations need be carried out
2020 after the generation of the target language surface forms, as until
2021 then words are isolated and it is not possible to know which words
2022 will get in contact . Therefore, these operations must be carried out
2023 by the module next to the generator, which is called
2024 post-generator. In order to signal which words are to be processed by
2025 the post-generator, this element is used in the surface form side of
2026 these entries in the morphological dictionary.
2028 The example in Figure \ref{fig:a} shows its use, in a Catalan
2029 morphological dictionary, for the preposition \textit{de}, which, when
2030 appearing before a singular or plural masculine definite article
2031 (\textit{el, els}), forms a contraction (\textit{del, dels}). The
2032 presence of the tag \texttt{<\textbf{a}/>} causes the activation of
2033 the post-generator, which checks whether the preposition is followed
2034 by one of the words that cause it to contract and, if it is so, makes
2035 the contraction (see page~\pageref{ss:postgen} for more details). The
2036 restriction \texttt{RL} indicates that this is an only-generation
2037 entry, since it does not make any sense for the analysis.
2039 \begin{figure}
2040 \begin{small}
2041 \begin{alltt}
2042 <\textbf{e} \textsl{r}="RL" \textsl{lm}="de">
2043 <\textbf{p}>
2044 <\textbf{l}><\textbf{a}/>de</\textbf{l}>
2045 <\textbf{r}>de<\textbf{s} \textsl{n}="pr"/></\textbf{r}>
2046 </\textbf{p}>
2047 </\textbf{e}>
2048 \end{alltt}
2049 \end{small}
2050 \caption{Use of the element \texttt{<\textbf{a}>} in a morphological
2051 dictionary}
2052 \label{fig:a}
2053 \end{figure}
2057 \paragraph{Element for group marking \texttt{<g>}}
2059 This element is used, inside the \texttt{<\textbf{l}>} and
2060 \texttt{<\textbf{r}>} elements, to define groups that require a
2061 special treatment beyond the normal word by word processing. It is
2062 used in inflective multiwords to signal the beginning and the end of
2063 the group of invariable lexical forms (one or more) that are adjacent to the
2064 inflected word and that, together with it, build an inseparable
2065 unit. In Section~\ref{ss:multipalabras} you will find a detailed
2066 explanation of the different multiword types, and in Figure
2067 \ref{fig:hacertilin} of that section you can see an example of its
2068 use.
2072 \paragraph{Element for joining of lexical forms \texttt{<j>}}
2073 \label{ss:j}
2075 This element is used only in the right side of an entry
2076 (\texttt{<\textbf{r}>}) to indicate that the words that form a
2077 multiword are treated as individual lexical forms and, therefore, have
2078 a grammatical symbol each. This way, this multiword will be processed
2079 as a unit by the analyser and by the tagger until it reaches the
2080 auxiliary module \texttt{pretransfer} (see section
2081 \ref{se:pretransfer}), which is responsible for separating the lexical
2082 forms it is made of so that they reach the transfer module as
2083 independent forms. If the linguist wants that these forms reach the
2084 generator as joined forms, building again a multiword, it is necessary
2085 to define a structural transfer rule that groups them in a multiword
2086 (see Section \ref{formatotransfer}). If, on the contrary, these joined
2087 forms must be only for the analysis, the entry must have the
2088 restriction \texttt{LR}.
2090 In Section~\ref{ss:multipalabras} you will find a more detailed
2091 explanation of this element. An example of its use can be found in
2092 Figure \ref{fig:cont} of the mentioned section.
2094 \subsubsection{Modification of bilingual dictionaries for the new
2095 lexical selection module}
2096 \label{dic_lextor}
2098 In 2007, a new module has been added to the Apertium system: the
2099 lexical selection module, which is described in section
2100 \ref{se:seleccio_lex}.
2102 In order for them to work in a lexical selection system, bilingual
2103 dictionaries must be slightly modified so that they allow the
2104 specification of more than one translation in target language. The
2105 only change is the addition of two new attributes to the element
2106 \texttt{<e>}. Although these new attributes can be used in all the
2107 dictionaries of a system, they only make sense in a bilingual
2108 dictionary entry.
2111 In Appendix~\ref{dixdtd} there is the part of the DTD \texttt{dix.dtd}
2112 \nota{MG: no caldria ajuntar les dues DTDs en una de sola?} where the
2113 element \texttt{e} used for dictionary entries is defined. The new
2114 attributes are:
2115 \begin{description}
2116 \item[slr (\emph{sense from left to right})] is used to specify the
2117 \emph{translation mark} when there is more than one translation from
2118 left to right for the lemma specified in the left side of an
2119 entry. The attribute can receive any value; however, the recommended
2120 action is to assign as value the lemma contained in the right part
2121 \texttt{<r>} (the translation of the lemma).
2122 \item[srl (\emph{sense from right to left})] is used to specify the
2123 \emph{translation mark} when there is more than one translation from
2124 right to left for the lemma specified in the right side of an entry.
2125 As before, the attribute can receive any value, but the recommended
2126 action is to assign as value the lemma contained in the left part
2127 \texttt{<l>} (the translation of the lemma).
2128 \end{description}
2130 Furthermore, in both cases the value of the attribute can end in a
2131 white space and the letter ``D'' to indicate that this is the default
2132 translation, that is, the translation that will be chosen when there
2133 is not enough information to make a decision. It is compulsory that,
2134 for entries that have more than one equivalent in target language, one
2135 of the equivalents, and only one, is marked with the letter ``D'' for
2136 \emph{default}.
2138 The following example shows how the new attributes are used. We take
2139 as example a bilingual English-Catalan dictionary, with the following
2140 entries having more than one translation in the target language:
2141 \begin{itemize}
2142 \item \emph{look}: can be translated into Catalan as \emph{mirar}
2143 (default) or as \emph{semblar} (according to the English senses
2144 \emph{view/seem}),
2145 \item \emph{floor}: can be translated into Catalan as \emph{pis}
2146 (default) or as \emph{terra} (according to the English senses
2147 \emph{level of building/ground}),
2148 \item \emph{pis}: can be translated into English as \emph{flat}
2149 (default) or as \emph{floor}.
2150 \end{itemize}
2152 This information is represented by means of the two attributes
2153 described:\label{entrades_lextor}
2154 \begin{alltt}
2155 \begin{small}
2156 <e srl="flat D">
2158 <l>flat<s n="n"/></l>
2159 <r>pis<s n="n"/><s n="m"/></r>
2160 </p>
2161 </e>
2163 <e slr="pis D" srl="floor">
2165 <l>floor<s n="n"/></l>
2166 <r>pis<s n="n"/><s n="m"/></r>
2167 </p>
2168 </e>
2170 <e slr="terra">
2172 <l>floor<s n="n"/></l>
2173 <r>terra<s n="n"/><s n="m"/></r>
2174 </p>
2175 </e>
2177 <e slr="mirar D">
2179 <l>look<s n="vblex"/></l>
2180 <r>mirar<s n="vblex"/></r>
2181 </p>
2182 </e>
2184 <e slr="semblar">
2186 <l>look<s n="vblex"/></l>
2187 <r>semblar<s n="vblex"/></r>
2188 </p>
2189 </e>
2190 \end{small}
2191 \end{alltt}
2194 %\settocdepth{paragraph}
2196 \subsubsection{Particularities of the different dictionary types}
2197 \label{ss:morfgen}
2199 Dictionary entries have different characteristics depending on the
2200 dictionary type. Although some of these characteristics have been
2201 presented in the previous sections, we are going to describe them here
2202 more exhaustively.
2205 \paragraph{Morphological dictionaries}
2207 In these dictionaries, used to generate the system's morphological
2208 analysers and generators, it is necessary to mark with
2209 \texttt{<\textbf{a}/>} those surface forms which, once generated, may
2210 need certain orthographic transformations due to the contact with
2211 other words; these operations are carried out by the post-generator.
2212 As these marks are only generated, the entries containing them must be
2213 only for the generation, which means that need to have the restriction
2214 \texttt{\textsl{r}=}\verb!"RL"! (from right to left). Figure
2215 \ref{fig:a} shows an entry containing this element.
2219 \paragraph{Bilingual dictionaries}
2220 \label{ss:bil}
2222 As explained before, we have not used paradigms in the bilingual
2223 dictionaries of our system; these dictionaries are built with generic
2224 entries in which, almost always, only part of speech is specified, and
2225 which do not have inflection information. For example, in the
2226 \texttt{es-ca} dictionary, the entry for the Spanish words
2227 \textit{pan}, \textit{panes} ("bread"), translated into Catalan as
2228 \textit{pa}, \textit{pans}, would be as shown in Figure \ref{fg:pan}.
2230 \begin{figure}
2231 \begin{small}
2232 \begin{alltt}
2233 <\textbf{e}>
2234 <\textbf{p}>
2235 <\textbf{l}>pan<\textbf{s} \textsl{n}="n"/></\textbf{l}>
2236 <\textbf{r}>pa<\textbf{s} \textsl{n}="n"/></\textbf{r}>
2237 </\textbf{p}>
2238 </\textbf{e}>
2239 \end{alltt}
2240 \end{small}
2241 \caption{Bilingual dictionary entry for the translation \emph{pan}
2242 (\texttt{es})--\emph{pa} (\texttt{ca})}
2243 \label{fg:pan}
2244 \end{figure}
2247 As you can see in the figure, only the first grammatical symbol
2248 \texttt{<\textbf{s} \textsl{n}="\ldots}\texttt{"}\texttt{/>} of each
2249 word is specified, since the unspecified symbols that come after the
2250 specified ones in the bilingual dictionary are copied from the source
2251 lexical form to the target lexical form. This entry, therefore, works
2252 both for \textit{pan} (singular) and for \textit{panes} (plural): the
2253 morphological analyser delivers the lemma (\emph{pan}) followed by the
2254 grammatical symbols that apply to the analysed surface form (\emph{n m
2255 sg} or \emph{n m pl} as applicable), and the symbols that are not
2256 specified in the bilingual entry (\emph{m sg} or \emph{m pl}) are
2257 copied to the target language. This is valid for both translation
2258 directions. The idea is to specify the information indispensable to
2259 differentiate the entries, and the rest is \textit{deduced}
2260 (copied). It is important to bear this in mind, because, when there
2261 are differences between the grammatical symbols of a lexical form from
2262 SL to TL, these differences must be specified in the bilingual
2263 dictionary. For example, when between source word and translated word
2264 there is a gender or number change, one has to specify the grammatical
2265 symbols in order (the order in which these symbols appear in the
2266 morphological dictionaries)\footnote{To know which grammatical symbols
2267 have been used in the dictionaries and in which order, see Appendix
2268 \ref{se:simbolosmorf}.} until the symbol that changes between SL and
2269 TL is reached.
2271 For example, to translate the Spanish word \textit{cama}, feminine
2272 noun, into the Catalan word \textit{llit}, masculine noun, the entry
2273 in the bilingual dictionary must be as shown in Figure
2274 \ref{fg:cama}. The gender must be specified (\emph{f}, \emph{m})
2275 because, if not, the symbols for gender and number would be copied
2276 from the SL lexical form into de TL lexical form. Therefore, when
2277 translating from \texttt{es} to \texttt{ca}, we would obtain the
2278 lexical form \emph{llit} with the symbols \texttt{n f sg} or \texttt{n
2279 f pl}. In both cases, the generator would receive as input a word that
2280 is impossible to generate, since the Catalan morphological dictionary
2281 does not contain any entry with lemma \emph{llit} and feminine gender.
2284 \begin{figure}
2285 \begin{small}
2286 \begin{alltt}
2287 <\textbf{e}>
2288 <\textbf{p}>
2289 <\textbf{l}>cama<\textbf{s} \textsl{n}="n"/><\textbf{s} \textsl{n}="f"/></\textbf{l}>
2290 <\textbf{r}>llit<\textbf{s} \textsl{n}="n"/><\textbf{s} \textsl{n}="m"/></\textbf{r}>
2291 </\textbf{p}>
2292 </\textbf{e}>
2293 \end{alltt}
2294 \end{small}
2295 \caption{Bilingual dictionary entry for the translation \emph{cama}
2296 (\texttt{es})--\emph{llit} (\texttt{ca})}
2297 \label{fg:cama}
2298 \end{figure}
2301 In this example, the number symbols are not specified; therefore, it
2302 works for the correspondence \textit{cama--llit} (singular) as well as
2303 for \textit{camas--llits} (plural). However, when there is a number
2304 change, the only way is to specify also the gender if the order used
2305 in all the dictionary for grammatical symbols is \emph{gender,
2306 number}.
2310 By means of a direction restriction \texttt{r} we can indicate which
2311 translations are to be done only in one direction and not in the other
2312 one (see the description of the restrictions \texttt{LR} and
2313 \texttt{RL} in page \pageref{restric}). This is necessary when the
2314 correspondence between two lexical forms is not symmetrical; in such
2315 case, in the bilingual dictionary two or more entries have to be
2316 created and a direction restriction must be applied, like in the
2317 example shown in Figure~\ref{fg:postre}. In this example, when
2318 translating from Spanish to Catalan (\texttt{LR}), we must generate
2319 only plural forms, since the word \textit{postres} ("dessert" ) in
2320 Catalan does not have singular form. But, on the other hand, we will
2321 translate into Spanish only in plural form (although in Spanish the
2322 word has singular and plural forms), since it is not possible to
2323 determine, from the Catalan word, whether the number should be
2324 singular or plural.
2326 \begin{figure}[htbp]
2327 \begin{small}
2328 \begin{alltt}
2329 <\textbf{e} \textsl{r}="LR">
2330 <\textbf{p}>
2331 <\textbf{l}>postre<\textbf{s} \textsl{n}="n"/><\textbf{s} \textsl{n}="m"/><\textbf{s} \textsl{n}="sg"/></\textbf{l}>
2332 <\textbf{r}>postres<\textbf{s} \textsl{n}="n"/><\textbf{s} \textsl{n}="m"/><\textbf{s} \textsl{n}="pl"/></\textbf{r}>
2333 </\textbf{p}>
2334 </\textbf{e}>
2336 <\textbf{e}>
2337 <\textbf{p}>
2338 <\textbf{l}>postre<\textbf{s} \textsl{n}="n"/><\textbf{s} \textsl{n}="m"/><\textbf{s} \textsl{n}="pl"/></\textbf{l}>
2339 <\textbf{r}>postres<\textbf{s} \textsl{n}="n"/><\textbf{s} \textsl{n}="m"/><\textbf{s} \textsl{n}="pl"/></\textbf{r}>
2340 </\textbf{p}>
2341 </\textbf{e}>
2342 \end{alltt}
2343 \end{small}
2344 \caption{Entries in the Spanish-Catalan bilingual dictionary for the
2345 correspondence \emph{postre}--\emph{postres} ("dessert")}
2346 \label{fg:postre}
2347 \end{figure}
2350 \label{pg:GD} There is another problem due to grammatical divergences
2351 between two languages that is resolved with the help of two special
2352 symbols, \texttt{GD} (for \textit{gender to be determined}) and
2353 \texttt{ND} (for \textit{number to be determined}), symbols which have
2354 to be defined in the symbol section of the bilingual dictionary. This
2355 problem arises when the grammatical information of a SL lexical form
2356 is not enough to determine the gender (masculine or feminine) or the
2357 number (singular or plural) of the TL lexical form. Let's put an
2358 example: the Spanish adjective \textit{común} ("common") is masculine
2359 and feminine at the same time (and, therefore, masculine--feminine,
2360 \texttt{mf}), but in Catalan the adjective has different forms for the
2361 masculine, \textit{comú}/\textit{comuns}, and the feminine,
2362 \textit{comuna}/\textit{comunes}. In the bilingual dictionary, the
2363 entry should be as shown in Figure~\ref{fg:comuna}: in the \texttt{LR}
2364 direction (from Spanish to Catalan), the gender information is not
2365 \texttt{m}, \texttt{f} nor \texttt{mf} but \texttt{GD}; this
2366 \textit{gender to be determined} will be determined next by the
2367 structural transfer module, by means of the application of the
2368 suitable transfer rules (usually, rules for the agreement between the
2369 lexical forms in a pattern; see Section \ref{ss:transfer} to obtain a
2370 detailed description of transfer rules). In an analogous way, a
2371 similar mechanism exists for singular--plural using the symbol
2372 \texttt{ND} (for example, in Spanish \textit{análisis} ("analysis") is
2373 singular and plural, whereas in Catalan the singular form is
2374 \textit{anàlisi} and the plural form \textit{anàlisis}).
2377 \begin{figure}[htbp]
2378 \begin{small}
2379 \begin{alltt}
2380 <\textbf{e} \textsl{r}="LR">
2381 <\textbf{p}>
2382 <\textbf{l}>común<\textbf{s} \textsl{n}="adj"/><\textbf{s} \textsl{n}="mf"/></\textbf{l}>
2383 <\textbf{r}>comú<\textbf{s} \textsl{n}="adj"/><\textbf{s} \textsl{n}="GD"/></\textbf{r}>
2384 </\textbf{p}>
2385 </\textbf{e}>
2387 <\textbf{e} \textsl{r}="RL">
2388 <\textbf{p}>
2389 <\textbf{l}>común<\textbf{s} \textsl{n}="adj"/><\textbf{s} \textsl{n}="mf"/></\textbf{l}>
2390 <\textbf{r}>comú<\textbf{s} \textsl{n}="adj"/><\textbf{s} \textsl{n}="m"/></\textbf{r}>
2391 </\textbf{p}>
2392 </\textbf{e}>
2394 <\textbf{e} \textsl{r}="RL">
2395 <\textbf{p}>
2396 <\textbf{l}>común<\textbf{s} \textsl{n}="adj"/><\textbf{s} \textsl{n}="mf"/></\textbf{l}>
2397 <\textbf{r}>comú<\textbf{s} \textsl{n}="adj"/><\textbf{s} \textsl{n}="f"/></\textbf{r}>
2398 </\textbf{p}>
2399 </\textbf{e}>
2400 \end{alltt}
2401 \end{small}
2402 \caption{Entries in the Spanish--Catalan bilingual dictionary for the
2403 correspondence \emph{común}--\emph{comú} ("common"), the first one
2404 for the translation from Spanish to Catalan and the two others for
2405 the translation from Catalan to Spanish}
2406 \label{fg:comuna}
2409 \end{figure}
2413 \paragraph{Post-generation dictionaries}
2414 \label{ss:postgen}
2418 In the morphological dictionary, the lexical forms which, once
2419 generated, may undergo contraction, apostrophation or other
2420 transformations, depending of which words are in contact with them in
2421 the output text, must have the post-generator activation mark
2422 (\texttt{<\textbf{a}/>}, see page \pageref{ss:a}) in the generation
2423 entry (\texttt{RL} direction). It is essential that the surface forms
2424 marked with the post-generator activation mark are identical in the
2425 morphological and the post-generation dictionaries of the same
2426 translator. In the post-generation dictionary, all entries begin with
2427 this activation mark.
2430 In Figure~\ref{fg:postgen} there is an extract of the Spanish
2431 post-generator; the example shows how the contraction for \textit{de}
2432 and \textit{el} is done, to form the word \textit{del}. The paradigm
2433 \texttt{puntuación} not defined in the example contains the
2434 non-alphabetic characters that can appear in a text. We can see in the
2435 example that the entry for the preposition \emph{de} has the mark
2436 \texttt{<\textbf{a}/>}. The paradigm assigned to this entry,
2437 "\texttt{el}", is the one defined just above. According to this entry,
2438 when the system receives as input the left string of the entry (the part
2439 between \texttt{<\textbf{l}>}) concatenated to the left string of
2440 the paradigm (that is, when the input is
2441 \texttt{"}\texttt{<a/>\textbf{de}<b/>\textbf{el}<b/>"} or
2442 \texttt{"}\texttt{<a/>\textbf{de}\\<b/>\textbf{el}[puntuación]}\texttt{"}),
2443 the module delivers as output string (the part between \texttt{<r>}
2444 elements) the string \texttt{"}\textbf{del}\texttt{"} followed by the
2445 blanks represented with \texttt{<b/>} or by the symbols represented
2446 with \texttt{[puntu\-a\-ción]}. Note that, in the module output, all
2447 the marks \texttt{<\textbf{a}/>} have been removed.
2453 \begin{figure}[htbp]
2454 \begin{small}
2455 \begin{alltt}
2456 <\textbf{dictionary}>
2457 <\textbf{pardefs}>
2458 ...
2459 <\textbf{pardef} \textsl{n}="el">
2460 <\textbf{e}>
2461 <\textbf{p}>
2462 <\textbf{l}>el<\textbf{b}/></\textbf{l}>
2463 <\textbf{r}>l<\textbf{b}/></\textbf{r}>
2464 </\textbf{p}>
2465 </\textbf{e}>
2466 <\textbf{e}>
2467 <\textbf{p}>
2468 <\textbf{l}>el</\textbf{l}>
2469 <\textbf{r}>l</\textbf{r}>
2470 </\textbf{p}>
2471 <\textbf{par} \textsl{n}="puntuación"/>
2472 </\textbf{e}>
2473 </\textbf{pardef}>
2474 ...
2475 </\textbf{pardefs}>
2476 <\textbf{section} \textsl{id}="main" \textsl{type}="standard">
2477 ...
2478 <\textbf{e}>
2479 <\textbf{p}>
2480 <\textbf{l}><\textbf{a}/>de<\textbf{b}/></\textbf{l}>
2481 <\textbf{r}>de</\textbf{r}>
2482 </\textbf{p}>
2483 <\textbf{par} \textsl{n}="el"/>
2484 </\textbf{e}>
2485 ...
2486 </\textbf{section}/>
2487 </\textbf{ditionary}>
2488 \end{alltt}
2489 \end{small}
2490 \caption{Post-generation dictionary data to perform the contraction
2491 for Spanish \emph{de} + \emph{el} = \emph{del} .}
2492 \label{fg:postgen}
2493 \end{figure} \nota{en l'exemple, "el" no ha de portar la marca
2494 d'activació oi? - l'he treta de l'exemple, treure-la dels diccionaris
2495 (Mikel?)}
2498 %\settocdepth{subsubsection}
2501 \subsubsection{Multiword lexical units}
2502 \label{ss:multipalabras}
2505 The designed dictionary format allows the creation of
2506 \textit{multiword lexical units} ---in short, \textit{multiwords}---
2507 of different kinds, depending on the problem to be approached.
2509 In this project we have considered three basic types of multiwords:
2510 \begin{enumerate}
2511 \item The most simple case are \textit{multiwords without inflection},
2512 which consist of only one lexical form: the lemma is made of two or
2513 more invariable orthographic words but it is tagged as a unit.
2514 Figure \ref{fig:msf} shows an example of invariable multiword (the
2515 Spanish expression \emph{hoy en día}, "nowadays"): It is made of
2516 three words separated by a blank (\texttt{<\textbf{b}/>}) and,
2517 although it actually consists of an adverb, a preposition and a
2518 noun, it is tagged as an adverb as a whole, since it acts as one.
2520 \begin{figure}
2521 \begin{small}
2522 \begin{alltt}
2523 <\textbf{e} \textsl{lm}="hoy en día">
2524 <\textbf{p}>
2525 <\textbf{l}>hoy<\textbf{b}/>en<\textbf{b}/>día</\textbf{l}>
2526 <\textbf{r}>hoy<\textbf{b}/>en<\textbf{b}/>día<\textbf{s} \textsl{n}="adv"/></\textbf{r}>
2527 </\textbf{p}>
2528 </\textbf{e}>
2529 \end{alltt}
2530 \end{small}
2531 \caption{Example of multiword without inflection in the morphological
2532 dictionary}
2533 \label{fig:msf}
2534 \end{figure}
2536 \item A more complicated issue is the case of \textit{compound
2537 multiwords}, made of more than one lexical form, each one with its
2538 grammatical symbols. The words they are made of are considered not
2539 to build a semantic unit like in the previous case, but to appear
2540 together building a unit due to contact reasons (phonetic or
2541 orthographic reasons). In this category we include
2542 \textit{contractions} and \textit{enclitic pronouns} accompanying
2543 verbs. To mark this phenomenon we use the tag \texttt{<\textbf{j}>}
2544 described in page~\pageref{ss:j}. You can see an example in
2545 Figure~\ref{fig:cont}, in which the analysis of \emph{del} delivers
2546 a lexical multiform made of two lexical forms: \emph{de},
2547 preposition, and \emph{el}, singular masculine definite determiner,
2548 linked with the \texttt{<\textbf{j}/>} element. The analyser and the
2549 part-of-speech tagger handle this multiwords as a unit; however,
2550 before entering the transfer module, they are processed by an
2551 auxiliary module called \texttt{pretransfer} (see section
2552 \ref{se:pretransfer}) which is responsible for separating the
2553 lexical forms they are made of. This way, they reach the transfer
2554 module as independent forms; the linguist has to decide whether they
2555 have to be joined again (which must be done in the structural
2556 transfer module) or they have to remain as independent forms through
2557 the next modules.
2560 In our system, the elements forming a contraction continue as
2561 independent forms, and the post-generator is responsible for making
2562 the contractions in the target language if it is necessary. On the
2563 other hand, enclitic pronouns are joined again to the verb by means of
2564 a structural transfer rule (see Section \ref{ss:transfer}), so the
2565 verb plus its enclitic pronouns get into the generation module as a
2566 single lexical multiform, its components joined with a
2567 \texttt{<\textbf{j}/>}. Therefore, entries containing enclitic
2568 pronouns must not have any direction restriction, as can be seen in
2569 the example in Figure \ref{fig:encl}, which shows a part of the
2570 paradigm for the Spanish verb "dar" ("to give"), specifically the
2571 entry for the infinitive form joined to an enclitic pronoun.
2574 \begin{figure}
2575 \begin{small}
2576 \begin{alltt}
2577 <\textbf{e} \textsl{lm}="del" \textsl{r}="LR">
2578 <\textbf{p}>
2579 <\textbf{l}>del</\textbf{l}>
2580 <\textbf{r}>de<\textbf{s} \textsl{n}="pr"/><\textbf{j}/>
2581 el<\textbf{s} \textsl{n}="det"/><\textbf{s} \textsl{n}="def"/>
2582 <\textbf{s} \textsl{n}="m"/><\textbf{s} \textsl{n}="sg"/></\textbf{r}>
2583 </\textbf{p}>
2584 </\textbf{e}>
2585 \end{alltt}
2586 \end{small}
2587 \caption{Entry in the morphological dictionary for the analysis of a
2588 contraction (the Spanish contraction \emph{del})}
2589 \label{fig:cont}
2590 \end{figure}
2592 \begin{figure}
2593 \begin{small}
2594 \begin{alltt}
2595 <\textbf{e}>
2596 <\textbf{p}>
2597 <\textbf{l}>ar</\textbf{l}>
2598 <\textbf{r}>ar<\textbf{s} \textsl{n}="vblex"/><\textbf{s} \textsl{n}="inf"/><\textbf{j}/></\textbf{r}>
2599 </\textbf{p}>
2600 <\textbf{par} \textsl{n}="S__cantar"/>
2601 </\textbf{e}>
2602 \end{alltt}
2603 \end{small}
2604 \caption{A fragment of the inflection paradigm for the Spanish verb
2605 \emph{dar} ("to give"), which shows the entry for the infinitive form
2606 followed by an enclitic pronoun. Enclitic pronouns are contained in
2607 the paradigm \texttt{S\_\_cantar}. Note that, unlike in Figure
2608 \ref{fig:cont}, this entry is both for analysis and generation.}
2609 \label{fig:encl}
2610 \end{figure}
2614 \item The most complicated case in our system is the case of
2615 \textit{multiwords with inner inflection} inside the lemma (or
2616 "split lemma" forms), like the example shown in Figure
2617 \ref{fig:echardemenos}. The lemma of this kind of multiwords has one
2618 part with inflection (the \emph{lemma head}) followed by one
2619 invariable part (the \emph{lemma tail}). The invariable part has to
2620 be put between \texttt{<\textbf{g}>} elements, so that it can be
2621 moved to the position immediately after the lemma head to obtain the
2622 whole lemma of the multiword. For example, the lemma of the Spanish
2623 multiwords \emph{echó de menos} ("he/she missed"), \emph{echándole
2624 de menos} ("missing him/her"), etc. has to be \emph{echar de menos}
2625 ("to miss"), since this form will be the one searched in the
2626 bilingual dictionary to find its translation. This means that the
2627 invariable lemma tail (\emph{de menos}) has to be moved after the
2628 uninflected lemma head (\emph{echar}). This moving backwards will be
2629 done by the auxiliary module \texttt{pretransfer} (see section
2630 \ref{se:pretransfer}) which runs before the structural transfer
2631 module.
2633 To understand the example in Figure \ref{fig:echardemenos}, you have
2634 to be aware that the paradigm defining the verb \emph{echar}
2635 includes, besides the verb inflection, the enclitic pronouns that
2636 can appear at the end of the inflected forms of the verb; in the
2637 output lexical multiform, this enclitic pronouns are joined using
2638 the empty element \texttt{<\textbf{j}/>}.
2642 \begin{figure}
2643 \begin{small}
2644 \begin{alltt}
2645 <\textbf{e} \textsl{lm}="echar de menos">
2646 <\textbf{i}>ech</\textbf{i}>
2647 <\textbf{par} \textsl{n}="aspir/ar__vblex"/> <!-it includes enclitic pronouns -->
2648 <\textbf{p}>
2649 <\textbf{l}><\textbf{b}/>de<\textbf{b}/>menos</\textbf{l}>
2650 <\textbf{r}><\textbf{g}><\textbf{b}/>de<\textbf{b}/>menos</\textbf{g}></\textbf{r}>
2651 </\textbf{p}>
2652 </\textbf{e}>
2653 \end{alltt}
2654 \end{small}
2655 \caption{A morphological dictionary entry containing a
2656 \texttt{<\textbf{g}>} group.}
2657 \label{fig:echardemenos}
2658 \label{fig:hacertilin}
2659 \end{figure}
2663 When the translation is also a \emph{split lemma} (for example, the
2664 translation of "to miss" in Catalan is \emph{trobar a faltar}, with
2665 forms like \emph{trobem a faltar}, \emph{trobar-lo a faltar}, etc.),
2666 it is necessary to place again the lemma tail in its original place,
2667 after the inflected form plus the enclitic pronouns (if any), and
2668 indicate the correspondence of these invariable parts of the lemma
2669 (\emph{de menos}, \emph{a faltar}) at both sides of the
2670 translation. So, in the example of Figure ~\ref{fig:echardemenos}, the
2671 \texttt{<\textbf{g}>} element is used to mark the group
2672 `\texttt{<b/>de<b/>menos}' in the morphological dictionary, whereas in
2673 the bilingual dictionary (see Figure~\ref{fig:menosfaltar}), the
2674 \texttt{<\textbf{g}>} element is used to establish the correspondence
2675 between the groups ``\texttt{<b/>de<b/>menos}'' and
2676 ``\texttt{<b/>a<b/>faltar}''. \nota{I com serà el cas de ``dirección
2677 general'' - ``direcciones generales''?}
2679 If the translation is not a \emph{split lemma}, you do not need to
2680 insert any \texttt{<\textbf{g}>} element in the target language
2681 string.
2683 \end{enumerate}
2685 \begin{figure}
2686 \begin{small}
2687 \begin{alltt}
2688 <\textbf{e}>
2689 <\textbf{p}>
2690 <\textbf{l}>echar<\textbf{g}><\textbf{b}/>de<\textbf{b}/>menos</\textbf{g}><\textbf{s} \textsl{n}="vblex"/></\textbf{l}>
2691 <\textbf{r}>trobar<\textbf{g}><\textbf{b}/>a<\textbf{b}/>faltar</\textbf{g}><\textbf{s} \textsl{n}="vblex"/></\textbf{r}>
2692 </\textbf{p}>
2693 </\textbf{e}>
2694 \end{alltt}
2695 \end{small}
2696 \caption{A bilingual dictionary entry containing two corresponding
2697 \texttt{<\textbf{g}>} groups.}
2698 \label{fig:menosfaltar}
2699 \end{figure}
2701 \subsubsection{Metaparadigms}
2702 \label{ss:metaparadigmas}
2705 \nota{Marco diu: Especificar la DTD?}
2708 When developing the dictionaries for the Occitan translator, we were
2709 faced with a new need: we wanted to be able to specify paradigms for
2710 verbs that had a same inflection pattern but whose root changed in the
2711 different inflected forms. With the existing paradigm system, a new
2712 paradigm had to be created for each of these verbs, since it was only
2713 possible to specify an inflection regularity pattern for a group of
2714 verbs with invariable root. With metaparadigms, it is possible to
2715 specify the inflection regularity as well as verb root variations.
2717 At the same time, metaparadigms allow the specification, in a single
2718 paradigm, of variations in the grammatical symbols of a lemma. That
2719 is, several lemmas can refer to a same metaparadigm even if they have
2720 different grammatical symbols. Whereas for Occitan, metaparadigms have
2721 allowed having a same paradigm for entries with root variations, for
2722 English, these have allowed having a same paradigm for entries with
2723 variations in their grammatical symbols.
2726 Related with this, we created the concept of metadictionary: it is a
2727 dictionary which contains metaparadigms as well as the normal
2728 paradigms used so far. The name of a metadictionary is
2729 \texttt{apertium-PAIR.}$L_1$\texttt{.metadix}
2730 (for example, for the English monolingual dictionary in the
2731 Apertium-en-ca system, \texttt{apertium-en-ca.en.metadix}). When
2732 linguistic data are compiled these dictionaries are pre-processed, so
2733 that they have the appropriate format for the dictionary compiler.
2735 \paragraph{Specification of metaparadigms}
2737 Metaparadigms are defined in the \texttt{<\textbf{pardefs}>} section
2738 of the monolingual dictionary, the same section where also the rest of
2739 the dictionary paradigms are defined. A metaparadigm, just like a
2740 paradigm, has a name specified in the attribute \texttt{n}. This name
2741 will have the same characteristics as in the other paradigms, with the
2742 difference that the variable part of the lemma root will be in brackets and
2743 in capital letters, as you can see in this example:
2745 \begin{alltt}
2746 <\textbf{pardef} n="m/é[T]er\_\_vblex">
2747 \end{alltt}
2749 This is the definition of a verb paradigm, where the inflection
2750 endings have a variable part in the root. The inflection paradigms
2751 specified inside this metaparadigm have to present inflection only in
2752 the part at the right of the brackets, for example like the one
2753 specified in the paradigm:
2755 \begin{alltt}
2756 <\textbf{par} n="mét/er\_\_vblex"/>
2757 \end{alltt}
2760 In conclusion, a complete example of metaparadigm definition would be:
2763 \begin{alltt}
2764 <\textbf{pardef} n="m/é[T]er__vblex">
2765 <\textbf{e}>
2766 <\textbf{p}>
2767 <\textbf{l}>e</\textbf{l}>
2768 <\textbf{r}>é</\textbf{r}>
2769 </\textbf{p}>
2770 <\textbf{i}><prm/></\textbf{i}>
2771 <\textbf{par} n="sent/eria__vblex"/>
2772 </\textbf{e}>
2773 <\textbf{e}>
2774 <\textbf{i}>é<prm/></\textbf{i}>
2775 <\textbf{par} n="mét/er__vblex"/>
2776 </\textbf{e}>
2777 </\textbf{pardef}>
2779 \end{alltt}
2782 The tag \texttt{<\textbf{prm}/>} is the marker that is used to place
2783 the variable text part (the root variation) in the paradigm
2784 definition.
2787 Once a metaparadigm is defined, we may want that a verb uses it. To do
2788 so, in the verb entry (inside a \texttt{<\textbf{e}>} element) we must
2789 indicate the suitable metaparadigm and, through the attribute
2790 \texttt{prm}, define with which letters we want to replace the
2791 variable part specified in brackets. For example:
2793 \begin{alltt}
2794 <\textbf{e} lm="acuélher">
2795 <\textbf{i}>acu</\textbf{i}>
2796 <\textbf{par} n="m/é[T]er__vblex" prm="lh"/>
2797 </\textbf{e}>
2799 \end{alltt}
2801 This entry defines the Occitan verb \emph{acuélher} ("to receive") and
2802 specifies that its inflection paradigm is the one defined by the
2803 metaparadigm \texttt{m/é[T]er\_\_vblex}, but replacing \texttt{T} with
2804 \texttt{lh}; that is, the letters following \emph{acu} will be
2805 \emph{élher} instead of \emph{éter}.
2809 As mentioned before, metaparadigms can also be used for entries which
2810 have some variation in their grammatical symbols. The way to specify
2811 them is basically the same: the variable part must be specified in the
2812 entry with the attribute \texttt{sa}, whereas in the paradigm the tag
2813 \texttt{<\textbf{sa}>} has to be placed where the optional grammatical
2814 symbol should appear.
2816 For example, we have the following metaparadigm:
2818 \begin{alltt}
2819 <\textbf{pardef} n="house__n">
2820 <\textbf{e}>
2821 <\textbf{p}>
2822 <\textbf{l}/>
2823 <\textbf{r}><\textbf{s} n="n"/><sa/><\textbf{s} n="sg"/></\textbf{r}>
2824 </\textbf{p}>
2825 </\textbf{e}>
2826 <\textbf{e}>
2827 <\textbf{p}>
2828 <\textbf{l}>s</\textbf{l}>
2829 <\textbf{r}><\textbf{s} n="n"/><sa/><\textbf{s} n="pl"/></r>
2830 </\textbf{p}>
2831 </\textbf{e}>
2832 </\textbf{pardef}>
2834 \end{alltt}
2837 and the following entry:
2839 \begin{alltt}
2840 <\textbf{e} lm="time">
2841 <\textbf{i}>time</\textbf{i}>
2842 <\textbf{par} n="house__n" sa="unc"/>
2843 </\textbf{e}>
2844 \end{alltt}
2846 where \emph{unc} means that the noun is uncountable.
2848 In the metaparadigm, the tag \texttt{<\textbf{sa}>} shows the place
2849 where the grammatical symbol is to be placed if an entry contains the
2850 attribute \texttt{sa} with a value, as happens in the entry for
2851 \emph{time}.
2854 A dictionary which contains entries like the ones described here is
2855 called metadictionary and must be pre-processed in order to generate a
2856 dictionary that follows the DTD for Apertium 2, since the engine does
2857 not allow the direct use of metaparadigms. The next section describes
2858 how is this pre-processing like.
2863 \paragraph{Pre-processing of the metadictionary}
2866 A metadictionary is an XML file to which two XSLT style sheets are
2867 applied, in order to pre-process the metaparadigms and obtain a
2868 dictionary with all the paradigms derived from the metaparadigms. The
2869 first style sheet, \texttt{buscaPar.xsl}, produces the list of verbs
2870 that use metaparadigms and deletes the possible repetitions of
2871 metaparadigms to be expanded. This style sheet generates, in
2872 combination with the sheet \texttt{principal.xsl}, a second style
2873 sheet called \texttt{gen.xsl}, which processes the metadictionary with
2874 the list of metaparadigms to be expanded and generates a dictionary in
2875 Apertium 2 format. Basically, what this generated style sheet does is:
2877 \begin{enumerate}
2880 \item In verb entries, if a verb uses a metaparadigm, this
2881 metaparadigm is replaced by the corresponding expanded and
2882 deparametrized paradigm. Thus, the previous example entry:
2884 \begin{alltt}
2885 <\textbf{e} lm="acuélher">
2886 <\textbf{i}>acu</\textbf{i}>
2887 <\textbf{par} n="m/é[T]er__vblex" prm="lh"/>
2888 </\textbf{e}>
2889 \end{alltt}
2891 would be deparametrized and expanded into:
2893 \begin{alltt}
2894 <\textbf{e} lm="acuélher">
2895 <\textbf{i}>acu</\textbf{i}>
2896 <\textbf{par} n="m/élher__vblex"/>
2897 </\textbf{e}>
2898 \end{alltt}
2901 \item On the other hand, since from the first pass the system knows
2902 which paradigms have to be created from metaparadigms, these are
2903 created. In the previous example, from the metaparadigm:
2905 \begin{alltt}
2906 <\textbf{pardef} n="m/é[T]er__vblex">
2907 <\textbf{e}>
2908 <\textbf{p}>
2909 <\textbf{l}>e</\textbf{l}>
2910 <\textbf{r}>é</\textbf{r}>
2911 </\textbf{p}>
2912 <\textbf{i}><prm/></\textbf{i}>
2913 <\textbf{par} n="sent/eria__vblex"/>
2914 </\textbf{e}>
2915 <\textbf{e}>
2916 <\textbf{i}>é<prm/></\textbf{i}>
2917 <\textbf{par} n="mét/er__vblex"/>
2918 </\textbf{e}>
2919 </\textbf{pardef}>
2920 \end{alltt}
2922 the system would generate the paradigm
2923 \texttt{"m/élher\_\_vblex"} :
2925 \begin{alltt}
2926 <\textbf{pardef} n="m/élher__vblex">
2927 <\textbf{e}>
2928 <\textbf{p}>
2929 <\textbf{l}>e</\textbf{l}>
2930 <\textbf{r}>é</\textbf{r}>
2931 </\textbf{p}>
2932 <\textbf{i}>lh/></\textbf{i}>
2933 <\textbf{par} n="sent/eria__vblex"/>
2934 </\textbf{e}>
2935 <\textbf{e}>
2936 <\textbf{i}>élh</\textbf{i}>
2937 <\textbf{par} n="mét/er__vblex"/>
2938 </\textbf{e}>
2939 </\textbf{pardef}>
2940 \end{alltt}
2942 \end{enumerate}
2944 After the metadictionary has been processed according to these steps,
2945 a .dix dictionary is generated which follows the DTD for Apertium 2
2946 and which can already be compiled.
2949 In the case of our second example, where the variable part was the
2950 sequence of grammatical symbols in the paradigm, the style sheets
2951 would be applied and, from the value \emph{unc} specified in the
2952 attribute \texttt{sa}, the following paradigm would be generated:
2954 \begin{alltt}
2955 <\textbf{pardef} n="house__n__unc">
2956 <\textbf{e}>
2957 <\textbf{p}>
2958 <\textbf{l}/>
2959 <\textbf{r}><\textbf{s} n="n"/><\textbf{s} n="unc"/><\textbf{s} n="sg"/></\textbf{r}>
2960 </\textbf{p}>
2961 </\textbf{e}>
2962 <\textbf{e}>
2963 <\textbf{p}>
2964 <\textbf{l}>s</\textbf{l}>
2965 <\textbf{r}><\textbf{s} n="n"/><\textbf{s} n="unc"/><\textbf{s} n="pl"/></r>
2966 </\textbf{p}>
2967 </\textbf{e}>
2968 </\textbf{pardef}>
2970 \end{alltt}
2972 for nouns the morphological analysis of which should be (in data
2973 stream format):
2976 \begin{alltt}
2977 time<n><unc><sg>
2978 \end{alltt}
2980 In this case, metaparadigms allows the use of the same paradigm for
2981 entries with the same inflection but with a slightly different
2982 morphological analysis.
2984 It is important to note that, when a dictionary uses metaparadigms
2985 and, accordingly, its name has the extension \texttt{.metadix}, this
2986 will be the file where dictionary changes have to be made (adding,
2987 changing or deleting entries or paradigms), since the file
2988 \texttt{.dix} is automatically generated from this one every time
2989 linguistic data are compiled and, therefore, any changes made in the
2990 latter will be overwritten during compilation.
2994 \subsection[Automatic generation of the modules]{Automatic generation
2995 of the lexical processing modules}
2996 \label{se:compiladoresdic}
2999 The four lexical processing modules (morphological analyser, lexical
3000 transfer, morphological generator and post-generator) are compiled
3001 from dictionaries by means of a single compiler based
3002 on letter transducers \cite{roche97}. This compiler is much faster
3003 than the ones used in the systems \textsf{interNOSTRUM}
3004 \cite{canals01b,garridoalenda01p,garrido99j} and \textsf{Traductor
3005 Universia} \cite{garrido03p, gilabert03j}, thanks to the use of new
3006 compiler building strategies and the minimization of partial
3007 transducers during the building process \cite{ortiz05j}.
3009 The division of dictionary entries into lemma and paradigm enables the
3010 effective construction of minimal letter transducers. The compiler
3011 makes the most of the factorization allowed by paradigms in order to
3012 speed up the construction. Taking into account that, in most European
3013 languages, word variations occur at the end or the beginning of words,
3014 we took advantage of this fact to improve the construction speed of
3015 the minimal transducer.
3017 Paradigms are also minimized before being inserted in the big
3018 transducer in order to reduce the size of the big transducer before
3019 its minimization. Since, before minimizing, the paradigms of the
3020 dictionaries for the languages we have dealt with usually have just a
3021 few hundreds of states, the minimization of these paradigms is a very
3022 fast process.
3024 If we assume that an entry can have at any point a reference to a
3025 paradigm, we could decide to copy at this point the transducer
3026 calculated in the paradigm definition. The method used in
3027 \emph{Apertium} is based on the idea that it is not always necessary
3028 to copy, because in certain cases it is possible to reuse a paradigm
3029 that was already copied. In particular, two or more entries that
3030 share a paradigm as a suffix can reuse the same copy of this paradigm;
3031 the same can be said when it is as a prefix. However, generally it is
3032 not possible to reuse paradigms if they are located in intermediate
3033 positions of different entries, since new suffixes (or prefixes) can
3034 be added to existing entries, which causes the information inserted in
3035 the transducer not to be consistent with the dictionary, and therefore
3036 the generated transducer would be incorrect (it would add string pairs
3037 that are not present in the formal language defined by dictionaries).
3039 Minimal letter transducers are built as explained next. From a string
3040 transduction it is possible to build a \textit{sequence of letter
3041 transductions} $S(s:t)$ with length $N = \max(|s|,|t|)$ which is
3042 defined as follows for each element $1 \leq i \leq N$:
3045 \begin{equation}
3046 \label{eq:transletras} S_i(s:t)=\left\{
3047 \begin{array}{ll}
3048 (s_i:\theta) & \textrm{if } i \leq |s| \wedge i > |t| \\
3049 (\theta:t_i) & \textrm{if } i \leq |t| \wedge i > |s| \\
3050 (s_i:t_i) & \textrm{in other cases}
3051 \end{array}\right.
3052 \label{e:montaje}
3053 \end{equation}
3055 It should be emphasized that the construction design forbids the
3056 existence of a $(s:t)$ that is equal to $(\epsilon:\epsilon)$, which
3057 is crucial for the consistence of the building method.
3059 The building method uses two procedures: the \textit{assembly}
3060 procedure inferred from equation \ref{e:montaje}, and the minimization
3061 procedure, which is executed by a conventional minimization algorithm
3062 \cite{vandesnepscheut93b} for deterministic finite state automata,
3063 which consists of inverting, determining, inverting again and
3064 determining again, taking as the alphabet of the automaton to be
3065 minimized the Cartesian product of $L$ and as empty transition the
3066 $\left(\theta:\theta\right)$.
3069 \begin{figure}
3070 \begin{center}
3071 \includegraphics[width=10cm]{fig1}
3072 \end{center}
3073 \caption{Building of the dictionary as prefix acceptor and link to
3074 paradigms through transitions $\left(\theta:\theta\right)$.}
3075 \label{fig:construccion}
3076 \end{figure}
3078 \begin{figure}
3079 \begin{center}
3080 \includegraphics[width=8cm]{fig2}
3081 \end{center}
3082 \caption{Minimized paradigm "-es \textbf{n m}" used in Figure
3083 \ref{fig:construccion}.}
3084 \label{fig:paradigmapan}
3085 \end{figure}
3087 \begin{figure}
3088 \begin{center}
3089 \includegraphics[width=8cm]{fig3}
3090 \end{center}
3091 \caption{Minimized paradigm "z/-ces \textbf{n m}" used in Figure
3092 \ref{fig:construccion}.}
3093 \label{fig:paradigmavez}
3094 \end{figure}
3098 Figure \ref{fig:construccion} shows a simplified example of the
3099 assembly process. Transductions, composed as in the equation
3100 \ref{e:montaje}, are inserted one by one in a transducer in the form
3101 of a \textit{prefix acceptor} or \textit{trie}, that is, in a way that
3102 there is only one node for each common prefix of the group of
3103 transductions that form the dictionary. With the suffixes of the
3104 transductions (that are not shared) new states are created. In the
3105 point where there is a reference to a paradigm, a replica of this
3106 paradigm is created and a link is created to the dictionary entry
3107 which is being inserted in the transducer by means of a null
3108 transduction $\left(\theta:\theta\right)$.
3110 Each paradigm, as it can be seen as a little dictionary, has been
3111 built according to this same procedure and been minimized to reduce
3112 the size of the content when building the big dictionary. In Figures
3113 \ref{fig:paradigmapan} and \ref{fig:paradigmavez} you can see the
3114 state of the paradigms used in Figure \ref{fig:construccion} after its
3115 minimization.
3119 \section{Part-of-speech tagger}
3120 \label{ss:tagger}
3122 \subsection{Module description }
3123 \label{functagger}
3126 The part-of-speech tagger is based on first-order hidden Markov
3127 models~\cite{rabiner89}, that is, on statistical data. The states of
3128 the Markov model represent parts of speech, and the observable
3129 parameters are ambiguity classes~\cite{cutting92a}, formed by groups
3130 of parts of speech.
3132 In spite of working with statistical information, the training and
3133 behaviour of the tagger improve with the application of restrictions
3134 that forbid certain sequences of parts of speech (in the first-order
3135 models, these sequences can only include two parts of speech). For
3136 example, in Spanish or Catalan a preposition can never be followed by
3137 a verb in personal form; this restriction is of great help when the
3138 word after a preposition is ambiguous and one of its possible analyses
3139 is a verb in personal form (e.g., \emph{de trabajo}, \emph{en
3140 libertad}, etc.). Restrictions are explicitly declared in the tagger
3141 definition file, sometimes in the form of \emph{prohibitions} and
3142 sometimes of \emph{obligations}.
3144 The morphological tags which the tagger works with are not the same as
3145 the ones used in the morphological analyser. Usually, the information
3146 delivered by the analyser is too detailed for the purposes of the
3147 part-of-speech disambiguation (for example, for most purposes, it
3148 suffices to group in the same category all common nouns, regardless of
3149 their gender and number). The use of finer-grained tags does not improve the
3150 results, whereas it increases the number of parameters to be estimated
3151 and intensifies the problem of lack of linguistic resources such as
3152 manually disambiguated texts. For this reason, in the tagger file one
3153 has to specify how to group the \emph{fine-grained} tags delivered by the
3154 morphological analyser into more general \emph{coarse} tags ---which
3155 we will call \emph{categories}--- that will be used in the
3156 part-of-speech disambiguation. Apart from coarse categories, one can
3157 also define lexicalized tags. Basically there are two types of
3158 lexicalizations described in bibliography: one type adds new
3159 observables and the other one, in addition, adds new states to the
3160 Markov model~\cite{pla04}; the tagger in Apertium uses the latter
3161 lexicalization type.
3163 It is important to note that, in spite of working with \emph{coarse}
3164 categories, the tagger outputs fine-grained tags like the ones from the
3165 morphological analyser. Sometimes it may occur that the morphological
3166 analyser delivers, for a certain word, two or more fine-grained tags that can
3167 be grouped under the same tagger category: e.g. in Spanish
3168 \emph{cante} can be the 1st or the 3rd singular person of the
3169 subjunctive present of the verb \emph{cantar} ("to sing"); both fine-grained
3170 tags, \texttt{\emph{<vblex><prs><p1><sg>}} and
3171 \texttt{\emph{<vblex><prs><p3><sg>}}, are grouped under the tagger
3172 category \ \texttt{VLEXSUBJ} (\emph{subjunctive verb}). In this case,
3173 one of both fine tags is discarded; in the tagger definition file it
3174 is possible to define which fine-grained tag, among the ones that compose a
3175 coarse tag, will be delivered after disambiguation.
3180 \subsection{Data for the part-of-speech tagger}
3181 \label{datostagger}
3182 \subsubsection{Introduction}
3183 \label{ss:introtagger} We describe next the format of the files that
3184 specify how to group the \emph{fine-grained} tags delivered by the
3185 morphological analyser into more general \emph{coarse} tags. In this
3186 files, moreover, one can specify \emph{restrictions} that help in the
3187 estimation of the statistical model underlying the process of lexical
3188 disambiguation, as well as preference rules to be applied when two
3189 fine-grained tags belong to the same category.
3192 The tagger assumes that, in the input stream, lexical forms will be
3193 appropriately delimited, as described in the format specification for
3194 the data stream between modules (Section \ref{se:flujodatos}). In
3195 brief, the format of the data delivered by the morphological analyser
3196 is the following:
3197 \begin{equation}
3198 \label{eq:formaanalizada}
3199 \begin{array}{rcl}
3200 \mbox{analysedform}&\to& \mbox{lexicalmultiform}\;
3201 [\; \mbox{lexicalmultiform} \; ]^*
3203 \mbox{lexicalmultiform}&\to& \mbox{lexicalform}\; [\;\mbox{lexicalform}\; ]^*\;\mbox{lemma-queue?} \\
3204 \mbox{lexicalform}&\to&\mbox{lemma}\;\mbox{finetag}\\
3205 \mbox{lemma-queue}&\to&\mbox{lemma}\\
3206 \mbox{finetag}&\to&\mbox{morphsymbol}\;[\;\mbox{morphsymbol}\;]^* \\
3207 \end{array}
3208 \end{equation}
3209 \label{formaanalizada}
3211 where:
3214 \begin{itemize}
3215 \item \emph{analysedform} is all the information delivered for each
3216 surface form in the output of the morphological analyser
3217 \item \emph{lexicalmultiform} is a sequence of one or more lexical
3218 forms followed, optionally, by an invariable queue as happens in some
3219 multiwords (like the Spanish expression \emph{cántale las cuarenta}).
3220 \item \emph{lexicalforms}\footnote{Separated from each other by a
3221 delimiter which corresponds to the \texttt{<j/>} element (see page
3222 \pageref{ss:j}).} are units made of one lemma and one or more
3223 grammatical symbols (which compose the fine-grained tag) with the output
3224 information of the analyser
3225 \item \emph{lemma-queue} is made of one or more lemmas
3226 \footnote{Separated from each other by the \texttt{<b/>} element
3227 (see page~\pageref{s3:b}).} that are the invariable part of a
3228 multiword. The queue of a multiword is made of the lemma or lemmas
3229 with no inflection that follow the lemmas with inflection. For
3230 example, the Spanish multiword \emph{cantar las cuarenta} ("to
3231 lecture", "to reproach") can take the forms \emph{cántale las
3232 cuarenta}, \emph{(le) cantaré las cuarenta}, \emph{cantándole las
3233 cuarenta}, etc. In this case, the queue would be \emph{las cuarenta}
3234 (see page~\pageref{ss:multipalabras} for more information).
3236 \item \emph{finetag} is made of one or more grammatical symbols
3237 (\emph{símbologram}).
3238 \end{itemize}
3240 For example, the entry for the Spanish ambiguous surface form
3241 \emph{correos} would have two lexical multiforms; the first lexical
3242 multiform would have one single lexical form, with lemma \emph{correo}
3243 ("post office") and a fine tag made of the grammatical symbols
3244 \emph{common noun}, \emph{masculine}, \emph{plural}; the second
3245 lexical multiform would be a sequence of two lexical forms, one with
3246 lemma \emph{correr} ("to move") and a fine tag made of the grammatical
3247 symbols \emph{lexical verb}, \emph{imperative}, \emph{second person},
3248 \emph{plural}, and the other one with lemma \emph{vosotros} ("you")
3249 and fine tag made of the grammatical symbols \emph{pronoun},
3250 \emph{enclitic}, \emph{second person}, \emph{masculine-feminine},
3251 \emph{plural}.
3253 \notavisible{An explanation of how a word containing more than one
3254 lexical form is treated when no multilexical form is defined for it
3255 should be added}
3257 \subsubsection{Format specification}
3258 \label{formatotagger} The format of the file (encoded in XML) is
3259 specified by the DTD that can be found in
3260 Appendix~\ref{ss:DTD_desambiguador}.
3263 The meaning of the different tags is the following:
3264 \begin{description}
3265 \item[\texttt{tagger}]: is the root element; its mandatory attribute
3266 \texttt{name} is used to specify the name of the tagger generated from
3267 the file.
3268 \item[\texttt{tagset}]: defines the \emph{coarse} tagset or categories
3269 with which the tagger works. Categories are defined by the fine-grained tags
3270 output by the morphological analyser.
3271 \item[\texttt{def-label}]: defines a category or coarse tag (whose
3272 name is specified in the mandatory attribute \texttt{name}) by means
3273 of a list of fine tags defined with one or more \texttt{tags-item}
3274 elements; an optional attribute \texttt{closed} indicates whether
3275 this is a closed category; if this is the case, it is assumed that
3276 an unknown word can never belong to this category.\footnote{Closed
3277 categories are those that do not grow when new words are created:
3278 prepositions, determiners, conjunctions, etc.}
3280 The more specific categories \emph{must} be defined before the more
3281 general ones. When the definition of a general category implicitly
3282 includes that of a specific category defined before, it is
3283 understood that it refers to all cases \emph{except} the ones
3284 defined by the more specific category.
3286 \item[\texttt{tags-item}]: is used to define a fine-grained tag by means of a
3287 sequence of grammatical symbols. The sequence of grammatical symbols
3288 that make up the fine tag is specified in the mandatory attribute
3289 \texttt{tags}. In this sequence, symbols are separated by a dot, and
3290 the asterisk ``\texttt{*}'' is used to express that any sequence of
3291 symbols may appear in its place. It is also possible to define
3292 lexicalized categories, specifying the lemma of the word in the
3293 attribute \texttt{lemma}.
3295 \item[\texttt{def-mult}]: defines special categories
3296 (\emph{multicategories}) made of more than one category, in order to
3297 deal with entries with more than one lexical form, like in the example
3298 given in the previous section. Each category is defined as a set of
3299 valid sequences (\texttt{sequence}) of previously defined categories
3300 or of fine-grained tags. It is designed for contractions, verbs with enclitic
3301 pronouns, etc.
3303 \item[\texttt{sequence}]: defines a sequence of elements, which can be
3304 categories (\texttt{label-item}) or fine-grained tags
3305 (\texttt{tags-item}). Using fine-grained tags directly is useful if one wishes
3306 to use a sequence of grammatical symbols that is not part of any
3307 previously defined fine tag \nota{MG: en comptes de 'fine tag' no es
3308 refereix aquí a 'category'?} or that represents a greater
3309 specialization of a defined fine tag \nota{ídem: category}.
3311 \item[\texttt{label-item}]: is used to refer to a category or coarse
3312 tag previously defined, to be specified in the mandatory attribute
3313 \texttt{label}.
3315 \item[\texttt{forbid}]: this (optional) section is aimed to define
3316 restrictions as sequences of categories \texttt{label-sequence} that
3317 can not occur in the language involved. In the current version, due to
3318 the fact that the tagger is based on first-order hidden Markov models,
3319 sequences can only be made of \emph{two} \texttt{label-items}.
3321 \item[\texttt{label-sequence}]: defines a sequence of categories
3322 (\texttt{label-item}).
3324 \item[\texttt{enforce-rules}]: this (optional) section allows defining
3325 restrictions in the form of obligations.
3327 \item[\texttt{enforce-after}]: defines a restriction that forces that
3328 a certain category can only be followed by the categories belonging to
3329 the set of categories defined in \texttt{label-set}. Note that this
3330 kind of restrictions is equivalent to defining several forbidden
3331 (\texttt{forbid}) sequences (\texttt{label-sequence}) with the
3332 category defined in the mandatory attribute \texttt{label} and the
3333 rest of categories that do not belong to the set defined in
3334 \texttt{label-set}. For this reason, this kind of restriction must be
3335 used very cautiously.
3337 \item[\texttt{label-set}]: defines a set of categories
3338 (\texttt{label-items}).
3340 \item[\texttt{preferences}]: used to define priorities in terms of
3341 which fine-grained tag must be delivered in the tagger output when two or more
3342 fine tags are assigned to the same category.
3344 \item[\texttt{prefer}]: specifies that, in case of conflict between
3345 different fine-grained tags assigned to the same category, the tagger must
3346 output the tag specified in the mandatory attribute \texttt{tags}. If
3347 a category contains more than one of the fine tags included in these
3348 \texttt{prefer} elements, the tag defined in the first place will be
3349 the selected one.
3350 \end{description}
3352 Figures~\ref{fg:exemple_desambiguador1}
3353 and~\ref{fg:exemple_desambiguador2} contain an example with the most
3354 significant parts of a tagger specification file defined by the DTD
3355 just described.
3357 % DTD moguda a Apèndix
3360 \begin{figure}[htbp]
3361 \begin{small}
3362 \begin{alltt}
3363 <?\textsl{xml} \textsl{version}="1.0" \textsl{encoding}="iso-8859-1"?>
3364 <!\textsl{DOCTYPE} \textbf{tagger} SYSTEM "tagger.dtd">
3365 <\textbf{tagger} \emph{name}="es-ca">
3366 <\textbf{tagset}>
3367 <\textbf{def-label} \textsl{name}="adv">
3368 <\textbf{tags-item} \textsl{tags}="adv"/>
3369 </\textbf{def-label}>
3370 <\textbf{def-label} \textsl{name}="detnt" \textsl{closed}="true">
3371 <\textbf{tags-item} \textsl{tags}="detnt"/>
3372 </\textbf{def-label}>
3373 <\textbf{def-label} \textsl{name}="detm" \textsl{closed}="true">
3374 <\textbf{tags-item} \textsl{tags}="det.*.m"/>
3375 </\textbf{def-label}>
3376 <\textbf{def-label} \textsl{name}="vlexpfci">
3377 <\textbf{tags-item} \textsl{tags}="vblex.pri"/>
3378 <\textbf{tags-item} \textsl{tags}="vblex.fti"/>
3379 <\textbf{tags-item} \textsl{tags}="vblex.cni"/>
3380 </\textbf{def-label}>
3381 <\textbf{def-mult} \textsl{name}="infserprnenc" \textsl{closed}="true">
3382 <\textbf{sequence}>
3383 <\textbf{label-item} \textsl{label}="vserinf"/>
3384 <\textbf{label-item} \textsl{label}="prnenc"/>
3385 </\textbf{sequence}>
3386 <\textbf{sequence}>
3387 <\textbf{label-item} \textsl{label}="vserinf"/>
3388 <\textbf{label-item} \textsl{label}="prnenc"/>
3389 <\textbf{label-item} \textsl{label}="prnenc"/>
3390 </\textbf{sequence}>
3391 </\textbf{def-mult}>
3392 <\textbf{def-mult} \textsl{name}="prepdet" \textsl{closed}="true">
3393 <\textbf{sequence}>
3394 <\textbf{label-item} \textsl{label}="prep"/>
3395 <\textbf{tags-item} \textsl{tags}="det.def.m.sg"/>
3396 </\textbf{sequence}>
3397 </\textbf{def-mult}>
3398 </\textbf{tagset}>
3399 <!-- ... -->
3400 \end{alltt}
3401 \end{small}
3402 \caption{Example of a tagger definition file (continues in
3403 Figure~\ref{fg:exemple_desambiguador2}).}
3404 \label{fg:exemple_desambiguador1}
3405 \end{figure}
3408 \begin{figure}[htbp]
3409 \begin{small}
3410 \begin{alltt}
3411 <!-- ... -->
3412 <\textbf{forbid}>
3413 <\textbf{label-sequence}>
3414 <\textbf{label-item} \textsl{label}=="prep"/>
3415 <\textbf{label-item} \textsl{label}=="vlexpfci"/>
3416 </\textbf{label-sequence}>
3417 <!-- ... -->
3418 </\textbf{forbid}>
3419 <\textbf{enforce-rules}>
3420 <\textbf{enforce-after} \textsl{label}=="prnpro">
3421 <\textbf{label-set}>
3422 <\textbf{label-item} \textsl{label}=="prnpro"/>
3423 <\textbf{label-item} \textsl{label}=="vlexpfci"/>
3424 <!-- ... -->
3425 </\textbf{label-set}>
3426 </\textbf{enforce-after}>
3427 <!-- ... -->
3428 </\textbf{enforce-rules}>
3429 <\textbf{preferences}>
3430 <\textbf{prefer} \textsl{tags}="vblex.pii.p3.sg"/>
3431 <\textbf{prefer} \textsl{tags}="vbser.pii.p3.sg"/>
3432 <!-- ... -->
3433 </\textbf{preferences}>
3434 </\textbf{tagger}>
3435 \end{alltt}
3436 \end{small}
3437 \caption{Example of a tagger definition file (comes from
3438 Figure~\ref{fg:exemple_desambiguador1}).}
3439 \label{fg:exemple_desambiguador2}
3440 \end{figure}
3442 \subsection{Some questions about the training of the part-of-speech
3443 tagger} The training of the part-of-speech tagger can be made both in
3444 a supervised manner, using manually disambiguated texts, and a
3445 unsupervised manner, using ambiguous texts.
3447 When the training is made with ambiguous texts (unsupervised), the
3448 format of the required text can be automatically obtained from a plain
3449 text corpus in the chosen language using the system's morphological
3450 analyser; in this case, the format of the text forms will be like the
3451 one defined in the figure~\ref{eq:formaanalizada2} (its description
3452 can be found in page~\pageref{formaanalizada}). As the chart shows,
3453 each analysed surface form can have more than one analysis (an
3454 \emph{analysedform} can give as a result more than one
3455 \emph{lexicalmultiform}).
3458 \begin{equation}
3459 \label{eq:formaanalizada2}
3460 \begin{array}{rcl} \mbox{analysedform}&\to&
3461 \mbox{lexicalmultiform}\; [\; \mbox{lexicalmultiform} \; ]^* \\
3462 \mbox{lexicalmultiform}&\to& \mbox{lexicalform}\;
3463 [\;\mbox{lexicalform}\; ]^*\;\mbox{lemma-queue?} \\
3464 \mbox{lexicalform}&\to&\mbox{lemma}\;\mbox{finetag}\\
3465 \mbox{lemma-queue}&\to&\mbox{lemma}\\
3466 \mbox{finetag}&\to&\mbox{morphsymbol}\;[\;\mbox{morphsymbol}\;]^* \\
3467 \end{array}
3468 \end{equation}
3469 \label{formaanalizada2}
3471 For the supervised training we need manually disambiguated text. The
3472 format of the text forms in this case will be like the format
3473 delivered by the morphological analyser (see
3474 Section~\ref{se:flujodatos}) except that, being the text already
3475 disambiguated, a surface form can never produce more than one lexical
3476 form, as shown in Figure~\ref{eq:formadesambiguada} (a
3477 \emph{disambiguatedform} will consist always of a single
3478 \emph{lexicalmultiform}).
3479 \begin{equation}
3480 \label{eq:formadesambiguada}
3481 \begin{array}{rcl}
3482 \mbox{disambiguatedform}&\to&\mbox{lexicalmultiform}\\
3483 \mbox{lexicalmultiform}&\to&\mbox{lexicalform}\;[\;\mbox{lexicalform}\;]^*\;\mbox{lemma-queue?}\\
3484 \mbox{lexicalform}&\to&\mbox{lemma}\;\mbox{finetag}\\
3485 \mbox{lemma-queue}&\to&\mbox{lemma}\\
3486 \mbox{finetag}&\to&\mbox{morphsymbol}\;[\;\mbox{morphsymbol}\;]^* \\
3487 \end{array}
3488 \end{equation}
3491 Finally, we need also the dictionary of the involved language to train
3492 the tagger. This dictionary is used to determine, in combination with
3493 the tagset specification, the different ambiguity classes with which
3494 the tagger will work.
3496 Figure \ref{fig:dependencias} shows the dependency diagram for the
3497 training and the use of the tagger.
3499 \nota{Aquest esquema canviarà amb el nou tagger - Sergio}
3501 \begin{figure}
3502 \begin{center}
3503 \includegraphics[width=15cm]{diagram}
3504 \end{center}
3505 \caption{Dependency diagram for the part-of-speech tagger.}
3506 \label{fig:dependencias}
3507 \end{figure}
3510 \newpage
3512 \section[Transfer pre-processing]{Auxiliary module: transfer
3513 pre-processing module}
3514 \label{se:pretransfer}
3515 \subsection{Justification} The transfer pre-processing module
3516 \texttt{pretransfer} is in charge of separating compound multiwords
3517 (see page~\pageref{ss:multipalabras}) and shifting certain parts of
3518 multiwords with inner inflection or \emph{split lemma} forms. This
3519 module processes the tagger output and generates an entry suitable for
3520 the transfer module. The processing performed by this module is
3521 necessary for different reasons:
3523 \begin{itemize}
3524 \item So that the transfer module can process these units separately
3525 in order to deal with, for example, the movement of clitic pronouns
3526 when changing from enclitic to proclitic and vice versa.
3527 \item So that the bilingual dictionary only has to store information
3528 about the lemmas to be translated. If the particles that make up a
3529 multiword are included jointly in the bilingual dictionary, the
3530 dictionary would have to store an entry for each of the different
3531 combinations. By separating compound multiwords and processing multiwords with
3532 inner inflection, we can avoid having
3533 entries including inflection variations in the bilingual dictionary.
3534 \end{itemize}
3536 \subsection{Behaviour and example}
3538 The program replaces each \texttt{<j/>} in the dictionary, that is,
3539 each \texttt{+} in the data stream, by a symbol for word end, a blank
3540 and a symbol for word beginning. Moreover, if the form is a multiword
3541 with split lemma, the queue is moved to the position between the first
3542 word of the multiword and its first grammatical symbol.
3544 The task of generating an output which has the original order accepted
3545 by the generator, is left to the rules of the transfer
3546 module, which are also responsible for creating the compound
3547 multiwords which may be required in the target language. In general,
3548 the generator works with the same multiwords as the morphological
3549 analyser, and with the elements in the same order; that is the reason
3550 why this task has to be done in the transfer module.
3552 We show below the result of applying this process to the compound
3553 multiword \textit{darlo} ("give it" in Spanish):
3555 \begin{small}
3556 \begin{alltt}
3557 \$ pretransfer
3558 ^dar<vblex><inf>+lo<prn><enc><p3><m><sg>\$ \(\longleftarrow\) \textrm{input}
3559 ^dar<vblex><inf>\$ ^lo<prn><enc><p3><m><sg>\$ \(\longleftarrow\) \textrm{output}
3560 \end{alltt}
3561 \end{small}
3563 As can be seen, it consists only in dividing the lexical forms of a
3564 compound multiword into individual lexical forms.
3566 When the input is a multiword with split lemma, the process is as
3567 shown in the following example for the Spanish multiword
3568 \textit{echarte de menos} ("to miss you"):
3570 \begin{small}
3571 \begin{alltt}
3572 \$ pretransfer
3573 ^echar<vblex><inf>+te<prn><enc><p2><m><sg># de menos\$
3574 ^echar# de menos<vblex><inf>\$ ^te<prn><enc><p2><m><sg>\$
3575 \end{alltt}
3576 \end{small}
3578 Here, besides dividing into lexical forms, the module moves the
3579 invariable lemma queue into the mentioned position. As you can see,
3580 semantic units are maintained after the movement of the invariable
3581 queue, since we can consider \textit{echar de menos} a verbal unit
3582 with own meaning.
3587 \section{Lexical selection module}
3588 \label{se:seleccio_lex}
3591 \subsection{Introduction}
3594 When the Apertium system is used to translate between less related
3595 languages than the ones dealt with in the first stages of the engine,
3596 the question of lexical selection becomes significant, because there
3597 are more cases, and more critical, in which a source language word can
3598 have more than one different translation in the target language. For
3599 this reason we created a new module, the lexical selection module,
3600 which deals with this problem.
3602 Before going into its characteristics, we will see how the problems of
3603 \emph{multiple equivalence} (the fact of existing more than one
3604 possible translation in target language for a source language lexical
3605 form) are tackled in Apertium in two ways.
3607 On the one hand, we have the situation where there is no big
3608 difference in meaning between the multiple equivalents in the target
3609 language, and the fact of choosing one or the other can not lead to
3610 any translation error. We could say that between these equivalents
3611 there is a synonymy or quasi-synonymy relation. In such a case, the
3612 linguist chooses one of the lemmas as a translation (generally the
3613 most frequent or usual), and adds a direction restriction to the other
3614 lemmas (with the attributes \texttt{LR} or \texttt{RL}) so that they
3615 are translated in the opposite direction but not in the direction
3616 where there are multiple equivalents.
3619 On the other hand, we have the case where there is a clear difference
3620 in meaning between the multiple equivalents, which can lead to
3621 translation errors if the inappropriate lemma is chosen. These are the
3622 cases dealt with the new lexical selection module. The linguist has to
3623 encode entries with the attributes \texttt{slr} or \texttt{srl}
3624 described in the next section, thus identifying the different
3625 translation options; then, the lexical selection module, by means of
3626 statistical methods, chooses the translation which is most suitable in
3627 a given context.
3631 Sometimes it is not easy to decide whether a multiple equivalence
3632 situation should be solved in one way or the other. For example, if
3633 there is difference in the meaning of two or more lemmas in the target
3634 language, but we think that the lexical selection module will not be
3635 capable of choosing the right translation by means of the context, we
3636 will follow the first method: choose a fixed translation (the most
3637 general, the most suitable in the maximum number of situations) and
3638 add a direction restriction to the rest of translations. In the other
3639 cases, we will encode the entries so that the decision is left to the
3640 lexical selection module.
3643 When we use an Apertium system without lexical selection module, the
3644 only way to add entries with different possible translations is the
3645 first one, that is, choosing an only translation and marking the other
3646 equivalences with a direction restriction. In the event that we use
3647 bilingual dictionaries with multiple translations, encoded with the
3648 attributes \texttt{slr} or \texttt{srl}, in a system that does not
3649 have any lexical selection module, a style sheet will
3650 convert these entries designed for a lexical selection module into
3651 entries with direction restrictions \texttt{LR} or \texttt{RL}, so
3652 that one of the multiple equivalents (the one chosen as default entry
3653 by the linguist) becomes the fixed translation of the source language
3654 lemma.
3658 As examples of bilingual equivalencies that should have a direction
3659 restriction, we can give the translation pairs \texttt{ca-es}
3660 \emph{encara -- aún/todavía} ("still") and \emph{sobtat --
3661 súbito/repentino} ("sudden"), the first one of which could be encoded
3662 like this:
3663 \begin{alltt}
3664 \begin{small}
3666 <e r="LR">
3668 <l>aún<s n="adv"/></l>
3669 <r>encara<s n="adv"/></r>
3670 </p>
3671 </e>
3674 <l>todavía<s n="adv"/></l>
3675 <r>encara<s n="adv"/></r>
3676 </p>
3677 </e>
3678 \end{small}
3679 \end{alltt}
3681 As examples of the second case (multiple equivalents with big
3682 difference in meaning) we have the pairs \texttt{es-ca} \emph{hoja --
3683 full/fulla} ("sheet/leaf") and \emph{muñeca -- nina/canell}
3684 ("doll/wrist"), as well as the \texttt{en-ca} examples shown in page
3685 \pageref{entrades_lextor}, where it is described how to specify these
3686 multiple equivalents in the bilingual dictionary.
3691 \begin{figure} {\footnotesize \setlength{\tabcolsep}{0.5mm}
3692 \begin{center}
3693 \begin{tabular}{ccccccccc} \\
3694 \parbox{0.95cm}{source language text} \\ $\downarrow$ \\
3695 \framebox{\parbox{1.0cm}{de-for\-matter}} $\rightarrow$ &
3696 \framebox{\parbox{0.6cm}{morph. anal.}} $\rightarrow$ &
3697 \framebox{\parbox{1.0cm}{POS tagger}} $\rightarrow$ &
3698 \framebox{\parbox{0.6cm}{lex. select.}} $\rightarrow$ &
3699 \framebox{\parbox{0.85cm}{struct. transf.}} $\rightarrow$ &
3700 \framebox{\parbox{0.6cm}{morph. gen.}} $\rightarrow$ &
3701 \framebox{\parbox{1.2cm}{post\-generator}} $\rightarrow$ &
3702 \framebox{\parbox{1.0cm}{re-for\-matter}} \\ & & & & $\updownarrow$ &
3703 & & $\downarrow$ \\ & & & & \framebox{\parbox{0.8cm}{lex. transf.}} &
3705 \parbox{0.95cm}{target language text} \\
3706 \end{tabular}
3707 \end{center} }
3708 \caption{The nine modules that build the assembly line in the version
3709 2 of the machine translation system Apertium.}
3710 \label{fig:moduls}
3711 \end{figure}
3713 Figure~\ref{fig:moduls} shows the new assembly line of the version 2
3714 of Apertium.\footnote{This figure substitutes the figure
3715 \ref{fg:modules} in page \pageref{pg:modules} which represents the
3716 version 1 of Apertium.} \nota{MG: caldria canviar la figura de la
3717 pàgina 6 per aquesta d'aquí?} The module in charge of the lexical
3718 selection (lexical selector) runs after the part-of-speech tagger and
3719 before the structural transfer module; therefore, this new module
3720 works only with source language information.
3723 Section~\ref{se:preprocessament} next describes the pre-processing
3724 that must be done on a bilingual dictionary containing more than
3725 one translation per entry (whether the system uses a
3726 lexical selector or not), and Section~\ref{se:lextor} describes
3727 how the lexical selector works and how it has to be trained.
3731 \subsection{Pre-processing of the bilingual dictionaries
3732 }\label{se:preprocessament}
3734 Bilingual dictionaries have been modified to allow the specification
3735 of more than one translation per entry (refer to Section
3736 \ref{dic_lextor} to learn how to write such dictionary entries); this
3737 fact makes it necessary to pre-process these dictionaries, since the
3738 Apertium engine works with compiled dictionaries in which there is
3739 only one possible translation for each word.
3741 The pre-processing of dictionaries is done automatically during
3742 compilation, therefore the final user does not need to perform any
3743 specific action.
3746 \subsubsection{Pre-processing without lexical selection module}
3748 When bilingual dictionaries with multiple equivalents are used in a
3749 system where there is no lexical selection module, the pre-processing
3750 is done by the application of the style sheet
3751 \texttt{translate-to\--de\-fault\--e\-qui\-va\-lent.xsl}. This style
3752 sheet turns dictionaries with multiple translations per entry into
3753 dictionaries with only one translation per entry; to do this, it
3754 chooses as translation the entry marked as default, and adds a
3755 direction restriction (\texttt{LR} or \texttt{RL} as applicable) to
3756 the other entries, so that they are only translated in the translation
3757 direction where there is no equivalent multiplicity. The style sheet
3758 is called from the \texttt{Makefile}.
3761 To put an example, the result of applying the style sheet on the first
3762 three entries shown in page \pageref{entrades_lextor} is the
3763 following:
3765 \begin{alltt}
3766 \begin{small}
3769 <l>flat<s n="n"/></l>
3770 <r>pis<s n="n"/><s n="m"/></r>
3771 </p>
3772 </e>
3774 <e r="LR">
3776 <l>floor<s n="n"/></l>
3777 <r>pis<s n="n"/><s n="m"/></r>
3778 </p>
3779 </e>
3781 <e r="RL">
3783 <l>floor<s n="n"/></l>
3784 <r>terra<s n="n"/><s n="m"/></r>
3785 </p>
3786 </e>
3787 \end{small}
3788 \end{alltt}
3790 \subsubsection{Preprocessing with lexical selection module}
3792 If the Apertium system works with a lexical selection module, the
3793 bilingual dictionary must be pre-processed in order to obtain:
3794 \begin{itemize}
3795 \item a monolingual dictionary that, for each source language word
3796 (for example \emph{look}) delivers all the possible translation marks
3797 or equivalents (\texttt{look\_\_mirar D} and
3798 \texttt{look\_\_semblar}); this dictionary will be used by the lexical
3799 selection module; and
3801 \item a new bilingual dictionary that, given a word with the lexical
3802 selection already done (for example \texttt{look\_\_semblar}) delivers
3803 the translation (\emph{semblar}); this will be the bilingual
3804 dictionary to be used in the lexical transfer.
3806 \end{itemize}
3809 This pre-processing is automatically done by means of the following
3810 software during dictionary compilation:
3811 \begin{itemize}
3812 \item \texttt{apertium-gen-lextormono}, that receives three
3813 parameters:
3814 \begin{itemize}
3815 \item the translation direction for which you want to generate the
3816 monolingual dictionary used in the lexical selection; \texttt{lr}
3817 for the translation left to right, and \texttt{rl} for the
3818 translation right to left;
3819 \item the monolingual dictionary to be pre-processed; and
3820 \item the file where the output monolingual dictionary has to be
3821 written.
3822 \end{itemize}
3824 \item \texttt{apertium-gen-lextorbil}, that receives three parameters:
3825 \begin{itemize}
3826 \item the translation direction (\texttt{lr} or \texttt{rl}) for
3827 which you want to generate the bilingual dictionary to be used by
3828 the lexical transfer module;
3829 \item the bilingual dictionary to be pre-processed; and
3830 \item the file where the output bilingual dictionary has to be
3831 written.
3832 \end{itemize}
3833 \end{itemize}
3835 \subsection{Execution of the lexical selection
3836 module}\label{se:lextor}
3838 The module responsible for the lexical selection runs after the
3839 part-of-speech tagger and before the structural transfer (see
3840 Figure~\ref{fig:moduls} in page~\pageref{fig:moduls}); therefore, it
3841 uses only information from the source language. However, during the
3842 training of the module, target language information is also used.
3845 \subsubsection{Training}\label{se:entrenament}
3847 To train the lexical selection module, a corpus in the source language
3848 and another one in the target language are required; they do not need
3849 to be related. Both corpora must be pre-processed before the
3850 training. This pre-processing, consisting in analysing the corpora and
3851 performing the POS disambiguation, can be done with
3852 \texttt{apertium-prepro\-cess\--cor\-pus\--lex\-tor}.
3854 The training of the module that performs the lexical selection
3855 consists of the following tasks:\footnote{The training of the models
3856 used for the lexical selection has been automated in all the packages
3857 using it. Furthermore, all the software mentioned has its UNIX manual
3858 page}
3862 \begin{enumerate}
3863 \item Obtain the list of words that will be ignored when performing
3864 lexical selection (\emph{stopwords}). This list can be done manually
3865 or using \texttt{apertium-gen-stopwords-lextor};
3866 \item Obtain the list of (source language) words that have more than
3867 one translation in the target language, using
3868 \texttt{apertium-gen-wlist-lextor};
3869 \item Translate to the target language all the words obtained in the
3870 previous step, using \texttt{apertium-gen-wlist-lextor-translation};
3871 \item Running \texttt{apertium-lextor --trainwrd} and using the target
3872 language pre-processed corpus, train a word co-occurrence model for
3873 the words obtained in the previous step;
3874 \item Running \texttt{apertium-lextor --trainlch} and using the source
3875 language pre-processed corpus, the dictionaries generated by the
3876 programs mentioned in Section~\ref{se:preprocessament} and the word
3877 co-occurrence models calculated in the previous step, train a
3878 co-occurrence model for each of the translation marks of those words
3879 that can have more than one translation in the target language.
3880 \end{enumerate}
3882 \subsubsection{Use}\label{se:us}
3884 The word co-occurrence models
3885 calculated for each translation mark as described in the previous
3886 section provide the information required to perform lexical selection
3887 with information from the context.
3889 Lexical selection is done by \texttt{apertium-lextor --lextor}; the
3890 formats used to communicate with the rest of the modules of the
3891 translation engine are:
3893 \begin{description}
3894 \item [Input:] text in the same format as the input for the structural
3895 transfer module, that is, text analysed and disambiguated, with
3896 invariable queues of multiwords moved before morphological tags.
3897 \item [Output:] text in the same format, but with the translation mark
3898 to be used when executing lexical transfer.
3899 \end{description}
3902 The following example illustrates the input/output formats used by the
3903 lexical selector (we have assumed in the example that only the English
3904 verb \emph{get} has more than one translation equivalent in the
3905 dictionaries):
3906 \begin{itemize}
3907 \item Source language text (English): \emph{To get to the city centre}
3908 \item Lexical selector input: \verb!^To<pr>$!
3909 \verb!^get<vblex><inf>$! \verb!^to<pr>$! \verb!^the<det><def><sp>$!
3910 \verb!^city<n><sg>$! \verb!^centre<n><sg>$!
3911 \item Translation marks in the en-ca bilingual dictionary for the verb
3912 \emph{get}: \texttt{rebre}, \texttt{agafar}, \texttt{arribar},
3913 \texttt{aconseguir D}
3914 \item Lexical selector output: \verb!^To<pr>$!
3915 \verb!^get__arribar<vblex><inf>$! \verb!^to<pr>$!
3916 \verb!^the<det><def><sp>$! \verb!^city<n><sg>$!
3917 \verb!^centre<n><sg>$!
3918 \end{itemize}
3921 \newpage
3922 \section{Structural transfer module}
3923 \label{ss:transfer}
3926 \nota{Faena per fer (mlf):
3927 \begin{itemize}
3928 \item Hi ha bastants vacil·lacions en la terminologia usada per a
3929 referir-se a conceptes i en els noms usats per als programes.
3930 \item He intentat substituir en cada cas l'expressió \emph{per
3931 defecte} per una altra més adequada; però caldrà distingir en quin cas
3932 ens trobem en cada cas.
3933 \end{itemize}}
3935 \subsection{Introduction}
3937 In 2007, Apertium incorporated a more advanced structural transfer system than
3938 the one used until then; it became necessary when we started developing
3939 machine translators for less related language pairs in
3940 comparison with the ones dealt with before, such as
3941 the \emph{English}--\emph{Catalan} translator.
3943 This enhanced transfer system is made of three modules, the first one
3944 of which can be used in isolation in order to run a
3945 \textbf{shallow-transfer} system (which is the transfer system used so
3946 far for related language pairs such as \emph{Spanish}--\emph{Catalan} or
3947 \emph{Spanish}--\emph{Galician}). When the system is used for less
3948 related language pairs and, therefore, an
3949 \textbf{advanced transfer} becomes necessary, the three transfer modules will be executed.
3951 The two transfer systems differ in the number of passes over the input
3952 text. The shallow-transfer system makes structural transformations
3953 with a single pass of the rules, which detect sequences or
3954 \emph{patterns} of lexical forms and perform on them the required
3955 verifications and changes. On the other hand, the advanced transfer
3956 system works with a new architecture that allows to detect
3957 \emph{patterns of patterns} of lexical forms with three passes, done
3958 by its three modules.
3960 We describe next the characteristics of the structural transfer system. Section
3961 \ref{functransfer} describes the shallow-transfer system and Section
3962 \ref{apertium2}, the advanced transfer system. The description of the
3963 shallow-transfer system is also applicable to the first module of the
3964 advanced transfer system, with the differences mentioned in that
3965 section. Section \ref{formatotransfer} describes the format used to
3966 create rules in both systems. In Section \ref{noutransfer} there is a
3967 detailed description of how the three modules of the advanced transfer
3968 system work, and finally, Section \ref{ss:preproceso_transfer}
3969 describes the pre-processing required by the modules.
3972 \subsection{Shallow-transfer}
3973 \label{functransfer}
3976 In this system, only the first of the three modules that compose the
3977 advanced transfer system is used. This module is called
3978 \emph{chunker}.
3980 The design of the language and the compiler used to generate the
3981 structural transfer module is largely based upon the MorphTrans
3982 language described in \cite{garridoalenda01p} and used by the MT
3983 systems \textsf{interNOSTRUM}
3984 \cite{canals01b,garridoalenda01p,garrido99j} (Spanish--Catalan) and
3985 \textsf{Traductor Universia} \cite{garrido03p, gilabert03j}
3986 (Spanish--Portuguese), developed by the Transducens group at the
3987 Universitat d'Alacant.
3990 The transfer process is organized around patterns representing
3991 fixed-length sequences of source language lexical forms (SLLFs) (see
3992 page~\pageref{pg:FSFL} for a description of lexical form (LF)); a
3993 sequence follows a certain pattern if it contains the sequence of lexical forms
3994 of the pattern. Patterns do not need to be constituents or
3995 phrases in the syntactic sense: they are mere concatenations of
3996 lexical forms that may need a conjoint processing additional to the
3997 simple word-for-word translation, due to the grammatical divergences
3998 between SL and TL (gender and number changes, reorderings,
3999 prepositional changes, etc). The catalogue of patterns defined for a
4000 certain language is selected with a view to covering the most common structural
4001 transformations. When source language and target language
4002 are syntactically similar, as is the case between Spanish, Catalan and
4003 Galician, simple rules based on sequences of lexical categories
4004 achieve a reasonable translation quality.
4006 The transfer module detects, in the SL, sequences of lexical forms
4007 that match one of the patterns previously defined in the pattern
4008 catalogue, and processes them applying the corresponding structural
4009 transfer rule, doing at the same time the lexical transfer by reading
4010 the bilingual dictionary.
4012 The \emph{pattern detection} phase occurs as follows: if the transfer
4013 module starts to process the $i$-th SLLF of the text, $l_i$, it tries
4014 to match the sequence of SLLFs $l_i, l_{i+1}, \ldots$ with all of the
4015 patterns in its pattern catalogue: the longest matching pattern is
4016 chosen, the matching sequence is processed (see below), and processing
4017 continues at SLLF $l_{i+k}$, where $k$ is the length of the pattern
4018 just processed. If no pattern matches the sequence starting at SLLF
4019 $l_i$, it is translated as an isolated word an processing restarts at
4020 SLLF $l_{i+1}$ (when no patterns are applicable, the system resorts to
4021 word-for-word translation). Note that each SLLF is processed only
4022 once: patterns do not overlap; hence, processing occurs left to right
4023 and in distinct "chunks".
4026 In the \emph{pattern processing } phase, the system takes the detected
4027 sequence of SLLFs and builds (using a program to consult the bilingual
4028 dictionary) a sequence of TL lexical forms (TLLFs) obtained after the
4029 application of the operations described in the rule associated to the
4030 detected pattern (reordering, addition, replacement or deleting of
4031 words, inflection changes, etc.). The information that does not change
4032 is automatically copied from SL to TL. The resulting data, that is,
4033 the lemmas with their associated morphological tags, are sent to the
4034 generator, which creates the inflected forms.
4038 For instance, the Spanish sequence \emph{una señal inequívoca} ("an
4039 unmistakable signal"), that would go from the tagger to the transfer
4040 module in the following format~\footnote{The example has been
4041 presented in a way that it does not contain superblanks with format
4042 information, so that the linguistic side of the transformation is
4043 clearer. See Chapter \ref{se:flujodatos}.}:\\
4045 \begin{alltt}
4046 \begin{small}
4047 \textasciicircum\textbf{uno}<det><ind><f><sg>\$
4048 \textasciicircum\textbf{señal}<n><f><sg>\$
4049 \textasciicircum\textbf{inequívoco}<adj><f><sg>\$
4050 \end{small}
4051 \end{alltt}
4054 \noindent{would be detected as a pattern by a rule for
4055 determiner--noun--adjective.} The transfer module would consult the
4056 bilingual dictionary to get the Catalan equivalents and, as it would
4057 detect a gender change in the word \emph{señal} (its Catalan
4058 translation \emph{senyal} is masculine), it would propagate this
4059 change to the determiner and the adjective to deliver the output
4060 sequence:\\
4062 \begin{alltt}
4063 \begin{small}
4064 \textasciicircum\textbf{un}<det><ind><m><sg>\$
4065 \textasciicircum\textbf{senyal}<n><m><sg>\$
4066 \textasciicircum\textbf{inequívoc}<adj><m><sg>\$
4067 \end{small}
4068 \end{alltt}
4070 \noindent{which the generation module would turn into the Catalan
4071 inflected sequence: \emph{un senyal inequívoc}.}
4073 The task of most rules is to ensure gender and number agreement in
4074 simple noun phrases (determi\-ner--noun, determiner--noun--adjective,
4075 determiner--adjective--noun, determiner--adjective, etc.), provided
4076 that there is agreement between the SLLFs of the detected
4077 pattern. These rules are required either because the noun changes its
4078 gender or number between SL and TL (as in the previous example) or
4079 because gender or number in the TL have to be determined due to the
4080 fact that it was ambiguous in SL for some of the words (for example,
4081 the Catalan determiner \emph{cap} can be translated into Spanish as
4082 \emph{ningún} (masc.) or \emph{ninguna} (fem.) depending on the
4083 accompanying noun: \emph{cap cotxe} (\texttt{ca}) $\rightarrow$
4084 \emph{ningún coche} (\texttt{es}) and \emph{cap casa} (\texttt{ca})
4085 $\rightarrow$ \emph{ninguna casa} (\texttt{es})). Furthermore, there
4086 other rules defined to solve frequent transfer problems between
4087 Spanish, Catalan and Galician, such as, among others:
4089 \begin{itemize}
4092 \item rules to change prepositions in certain constructions: \emph{in
4093 Barcelona} (\texttt{es}) $\rightarrow$ \emph{a Barcelona}
4094 (\texttt{ca}); \emph{consiste en hacer} (\texttt{es}) $\rightarrow$
4095 \emph{consisteix a fer} (\texttt{ca});
4097 \item rules to add/remove the preposition \emph{a} in certain Galician
4098 modal constructions with the verbs \emph{ir} and \emph{vir}: \emph{vai
4099 comprar} (\texttt{gl}) $\rightarrow$ \emph{va a comprar}
4100 (\texttt{es});
4102 \item rules for articles before proper nouns: \emph{ve la Marta}
4103 (\texttt{ca}) $\rightarrow$ \emph{viene Marta} (\texttt{es});
4105 \item lexical rules, for instance, to decide the correct translation
4106 of the adverb \emph{molt} (\texttt{ca}) into Spanish (\emph{muy,
4107 mucho}) or of the adjective \emph{primeiro} (\texttt{gl}) or
4108 \emph{primer} (\texttt{ca}) into Spanish (\emph{primer, primero});
4110 \item rules to displace atonic or clitic pronouns, whose position in
4111 Galician is different to that in Spanish (proclitic in Galician and
4112 enclitic in Spanish or vice versa): \emph{envioume} (\texttt{gl})
4113 $\rightarrow$ \emph{me envió} (\texttt{es}); \emph{para nos dicir}
4114 (\texttt{gl}) $\rightarrow$ \emph{para decirnos} (\texttt{es}).
4116 \end{itemize}
4120 \emph{Multiwords} (its different types are described in
4121 page~\pageref{ss:multipalabras}) are processed in a special way in
4122 this module:
4124 \begin{itemize}
4125 \item \emph{Multiwords without inflection}, made of only one lexical
4126 form, do not need any special processing, since they are treated like
4127 other LFs.
4128 \item In the case of \emph{compound multiwords}, that is, multiwords
4129 formed by more than one \emph{lexical form}, each one with its own
4130 grammatical symbols and joined to each other with the element
4131 \texttt{<j>} in the dictionary entry (which corresponds to the symbol
4132 '+' in the data stream), the auxiliary module \texttt{pretransfer}
4133 (see \ref{se:pretransfer}), located before this module, separates the
4134 different lexical forms so that they reach the transfer module as
4135 independent LFs. If we want to join them again so that they reach the
4136 generator as multiwords (as is the case of enclitic pronouns in our
4137 system), it has to be done by means of a transfer rule, using the
4138 \texttt{<\textbf{mlu}>} element (described later, in section
4139 \ref{ss:mlu}). In page~\pageref{regla_verbo2} you can find an example
4140 of a rule for joining enclitic pronouns to the verb.
4141 \item As for \emph{multiwords with inner inflection}, the
4142 \texttt{pre\-trans\-fer} module moves the lemma queue (the invariable
4143 part) to place it after the lemma head (the inflective form), thus
4144 making possible to find the multiword in the bilingual
4145 dictionary. This kind of multiwords must be processed by a structural
4146 transfer rule which replaces the lemma queue in its proper
4147 position. This is done by using, in the output of the rule, the attributes
4148 \texttt{lemh} ``lemma head'' and \texttt{lemq} ``lemma queue'') of the
4149 \texttt{<\textbf{clip}>} element. See page~\pageref{ss:lu} for a more
4150 detailed description of the use of this element, and page
4151 \pageref{regla_verbo1} to see two rules where these attributes are
4152 used.
4153 \end{itemize}
4156 \subsection{Advanced transfer}
4157 \label{apertium2}
4159 The shallow-transfer architecture described in the previous section is
4160 based, as we have seen, in the automatic handling of word
4161 co-occurrence patterns by means of rules defined by the user. This
4162 model considers two levels from the point of view of the nature of
4163 data: a basic level we call \textit{lexical level}, which handles
4164 words and the tasks of consulting and changing its characteristics
4165 (lemma and tags), besides translating individual lemmas by asking the
4166 bilingual dictionary; and another level we call \textit{word pattern
4167 level}, which is in charge of doing, when applicable, reorderings of
4168 the words that build these patterns, as well as changes in the
4169 properties of words that depend on the specific pattern that has been
4170 detected. All this process of detection and manipulation of words and
4171 patterns is carried out in a single pass.
4173 In contrast, the new advanced transfer architecture is defined as a
4174 transfer system in three levels and three passes. The first two
4175 levels, lexical and pattern level, are the same ones of the
4176 shallow-transfer system. The new added level is a level of
4177 \emph{patterns of patterns} of words. The aim of this new processing
4178 level is to allow the handling and interaction of patterns of words in
4179 a similar way as words are handled in the patterns of the shallow
4180 system. With this new structure we intend to achieve a more
4181 appropriate handling of all transformations that may be required when
4182 translating from one language to another. We want to emphasize that
4183 the definition of word patterns in the shallow-transfer system does
4184 not need to be the same as the definition of word patterns in the
4185 advanced system: we pretend that, in the latter, patterns have a
4186 \textit{spirit} of phrases that does not exist in the previous
4187 system. Therefore we will use the term \textit{chunk} to refer to word
4188 sequences in the advanced transfer system.
4190 The advanced transfer system is organized in three passes. According
4191 to the Apertium processing mode, these three passes are carried out by
4192 three different modules (programs):
4194 \begin{itemize}
4195 \item \texttt{chunker}: identifies chunks, translates word for word,
4196 and carries out required reorderings and morphosyntactic data
4197 propagation inside the chunk (for example, to maintain
4198 agreement). Besides, it creates the chunks that will be processed by
4199 the next module. The \texttt{chunker} has the option of running as a
4200 single module in a shallow-transfer system. This is controlled by an
4201 attribute in the \texttt{<transfer>} element.
4204 \item \texttt{interchunk}: this module receives the chunks generated
4205 by the \texttt{chunker} and is able to reorder them, modify the
4206 ``syntactic information'' associated to each chunk and, finally,
4207 output the chunks in the new order and with the new properties,
4208 creating new chunks if needed.
4209 \item \texttt{postchunk}: it receives the chunks modified by the
4210 interchunk and carries out final tasks concerning modification of the
4211 words contained in each chunk and printing of the text contained in
4212 chunks in the format accepted by the generator.
4213 \end{itemize}
4216 In the following lines we specify the format of the chunks that
4217 circulate between the modules of the transfer system (Section
4218 \ref{sec:format}) and the letter case handling in chunks (Section
4219 \ref{ss:majuscules}), which is different from case handling of
4220 individual lexical forms in a shallow-transfer system.
4223 The following section, \ref{formatotransfer}, describes the format of
4224 transfer rules, which is the same for the three modules and the two
4225 transfer modes, with little differences. Finally, after this
4226 description, in \ref{noutransfer} you will find a more detailed
4227 explanation of the three modules that make up an advanced transfer
4228 system.
4233 \subsubsection{Chunk format}
4234 \label{sec:format}
4237 Communication between \texttt{chunker} and \texttt{interchunk}, as
4238 well as between \texttt{interchunk} and \texttt{postchunk}, is
4239 performed through sequences of chunks. We define $C$ as a
4240 \emph{sequence of chunks}, that has the form:
4242 C=b_{0}c_{1}b_{1}c_{2}b_{2} \ldots b_{k-1}c_{k}b_{k}
4245 where each $b_i$ is a \textit{superblank}, and each $c$ is a
4246 \emph{chunk}. A chunk $c$ is defined as a string
4247 \verb!^!$F$\verb!{!$W$\verb!}$! that contains the following
4248 information:
4250 \begin{itemize}
4251 \item $F$ is the \emph{lexical pseudoform}\nota{help: pseudoforma
4252 lèxica = lexical pseudoform or pseudolexical form}; it is a string
4253 that has the form $fE$, where $f$ is the \textit{pseudolemma} of the
4254 chunk, and $E=e_{1}e_{2} \ldots$ is a sequence of grammatical symbols
4255 called \emph{chunk symbols}. Changing these symbols will cause the
4256 changing of the morphological information of words in the chunk, if
4257 this information is linked to these parameters.
4258 \item $W=b_{0}w_{1}b_{1}w_{2}b_{2} \ldots w_{k}b_{k}$ is the sequence
4259 of words $w_i$ sent by the chunker with the intermediate
4260 \textit{superblanks} $b_i$. These words have the same format in both
4261 transfer systems, that is, an individual word
4262 $w_i=$\verb!^!$l_{i}E_{i}$\verb!$! contains lemma $l_i$ and
4263 grammatical symbols $E_i$, some of which can be \emph{references or links
4264 to the symbols} of the chunk and are identified with natural numbers
4265 \texttt{<1>}, \texttt{<2>}, \texttt{<3>}, etc. These references to
4266 symbols correspond, in the specified order, to the symbols of $E$.
4267 \end{itemize}
4269 The following is a use example of the described format, with the text
4270 \emph{el gat} ("the cat"):
4272 \begin{small}
4273 \begin{alltt}
4274 \verb!^!det_nom<SN><m><pl>\verb!{^!el<det><def><2><3>$[
4275 <a href="http://www.ua.es">]^gat<n><2><3>$\verb!}$![</a>]
4276 \end{alltt}
4277 \end{small}
4279 The characters \verb!{! and \verb!}!, if present in the original text,
4280 must be escaped with a backslash \verb!\!.
4282 \subsubsection{Letter case handling}
4283 \label{ss:majuscules}
4285 For each chunk, the case of words is determined by the case of the
4286 pseudolemma of the chunk, taking into account the following rules:
4288 \begin{itemize}
4290 \item When all the letters of the pseudolemma are in lower case: the
4291 case state of words is not modified.
4292 \item When the first letter of the pseudolemma is in upper case and
4293 the rest are in lower case: in the module \texttt{postchunk}, when
4294 words are printed, the letter that is the first of the chunk after all
4295 the possible word reorderings will be put in upper case \nota{MG: and
4296 the rest will be put in lower case except proper nouns? is this
4297 correct?}.
4298 \item When all the letters of the pseudolemma are in upper case: all
4299 the words will remain upper case.
4300 \end{itemize}
4303 It is required that the words in the chunk are not capitalized unless
4304 they are proper nouns, so as to avoid the postchunk module having to
4305 look for the word that has to lose capitalization, if this is the
4306 case\nota{MG: I am not sure I understand this}. This task belongs to
4307 the \texttt{chunker} module and is done with a macro or similar
4308 mechanism.
4311 %\settocdepth{subsection}
4312 \subsection{Format specification for structural transfer rules}
4313 \label{formatotransfer}
4316 This section describes the format in which structural transfer rules
4317 are written. In the Appendix, in sections~\ref{ss:dtdtransfer},
4318 \ref{ss:dtdinterchunk} and \ref{ss:dtdpostchunk}, there is the formal
4319 definition (DTD).
4321 Structural transfer rules files have two well-differentiated parts:
4322 one for the declaration of the elements to be used in rules, and
4323 another one for the rules themselves.\\
4326 In the \textbf{declaration} part we find:
4328 \begin{itemize}
4330 \item A series of declarations of \emph{lexical categories}, which
4331 specify those lexical forms that will be treated as a particular
4332 category and will be detected by patterns. The linguist may include any data about the lexical form
4333 to define a category; categories can be very generic (i.e. all the
4334 nouns) or very specific (i.e. only those determiners that are
4335 demonstrative feminine plural).
4336 \item A series of declarations of the \emph{attributes} we want to
4337 detect in lexical forms (like \emph{gender}, \emph{number},
4338 \emph{person} or \emph{tense}), to perform with them the required
4339 transformation operations and send the resulting data in the output of
4340 the rules. The declaration of an attribute contains the name of the
4341 attribute and the possible values it can take in a lexical form (in
4342 general they correspond to the morphological attributes that
4343 characterize the form): for example, the attribute \emph{number} can
4344 take the values \emph{singular}, \emph{plural}, \emph{singular-plural}
4345 (for invariable lexical forms, like \emph{crisis} in Spanish) and
4346 \emph{number to be determined} (for TL lexical forms with different
4347 forms for \emph{singular}--\emph{plural}, but whose number can not be
4348 determined in the translation due to the fact that the SL lexical form
4349 is invariable in number, see explanation in page \pageref{pg:GD}). If
4350 inside the rule, outside of the pattern, one wishes to refer to any of
4351 the lexical categories defined in the previous point (to perform tests
4352 or actions on them), it will be also necessary to define attributes
4353 for them.
4355 \item A series of declarations of \emph{global variables}, which are
4356 used to transfer values of active attributes inside a rule, or from
4357 one rule to the ones applied subsequently.
4359 \item A section for the \textit{definition of string lists}, generally
4360 lists of lemmas, which will be used to make searches on them for a certain value
4361 to perform a specific transformation.
4363 \item A series of declarations of \emph{macro-instructions};
4364 macro-instructions contain sequences of frequently used instructions,
4365 and can be included in different rules (for example, a
4366 macro-instruction to ensure gender and number agreement between two
4367 lexical forms of a pattern).
4369 \end{itemize}
4371 In the \textbf{structural transfer rules} we find:
4373 \begin{itemize}
4374 \item The definition of the pattern that will be detected, specified
4375 as a sequence of lexical categories as they have been defined in the
4376 declaration part. It must be noted that, if a sequence of lexical
4377 forms matches two different rules, firstly, the longest is chosen, and
4378 secondly, for rules of the same length, the one defined before is
4379 chosen.
4381 \item The process part of the rules, where actions to be performed on
4382 SLLF are specified, and the TL pattern is built.
4384 \end{itemize} \nota{Assegurem-nos que totes les sigles estan
4385 definides}
4387 In the following pages we describe in detail the characteristics of
4388 all the elements used in rules.
4391 \subsubsection{Element \texttt{<transfer>}}
4393 (\textit{Only in the chunker module})
4395 This is the root element of the \texttt{chunker} module and contains
4396 all the rest of the elements of the structural transfer rules file of
4397 this module.
4399 Its attribute \texttt{default} can take two values:
4400 \begin{itemize}
4402 \item \texttt{lu}: it means that it will run in shallow mode, that is,
4403 as only transfer module in a shallow-transfer system and, therefore,
4404 no special action will be done on words not detected by any pattern
4406 \item \texttt{chunk}: it means that it will run in advanced mode and,
4407 therefore, when a word is not recognized by any rule, a chunk will be
4408 created to encapsulate it, so that it can be processed by the next
4409 transfer modules of an advanced transfer system.
4411 \end{itemize}
4413 The default value is \texttt{lu}.
4415 \subsubsection{Element \texttt{<interchunk>}}
4417 (\textit{Only in interchunk})
4419 This is the root element of the \texttt{interchunk} module and
4420 contains all the rest of the elements of the structural transfer rules
4421 file of this module.
4424 \subsubsection{Element \texttt{<postchunk>}}
4427 (\textit{Only in postchunk})
4429 This is the root element of the \texttt{postchunk} module and contains
4430 all the rest of the elements of the structural transfer rules file of
4431 this module.
4435 \subsubsection{Element for category definition section
4436 \\\texttt{<section-def-cats>}} \nota{Atenció a l'ús polisèmic del mot
4437 \emph{categoria} en el document}
4439 This section contains the definition of the lexical categories that
4440 will be used to create the patterns used in rules. Each definition is
4441 made with a \texttt{<\textbf{def-cat}>}.
4445 \subsubsection{Element for category definition \texttt{<def-cat>}}
4447 Each category definition has a mandatory name \texttt{n}
4448 (e.g. \texttt{det}, \texttt{adv}, \texttt{prep}, etc.) and a list of
4449 categories (\texttt{<\textbf{cat-item}>}) that define it. The name of
4450 the category can not contain accents.
4453 \subsubsection{Element for category \texttt{<cat-item>}}
4456 This element has two well-differentiated uses depending on the module
4457 it is used in.
4459 \paragraph{Use in chunker (shallow transfer and advanced transfer)}
4462 This element defines the lexical categories that will be used in
4463 patterns, that is, that the linguist wishes to detect in the source
4464 text. These categories are defined by a subsequence of the fine tags
4465 (see definition in page~\pageref{ss:introtagger}) that deliver both
4466 the morphological analyser and the tagger\footnote{Please note that
4467 throughout the different linguistic modules, different lexical
4468 categorizations are used: in morphological dictionaries, lemmas are
4469 accompanied by a fine tag (for instance, \texttt{\emph{<n><m><pl>}}
4470 for plural masculine nouns); the POS tagger groups these fine tags in
4471 more general tags (for instance, the category \texttt{NOUN} for all
4472 the nouns), although its output is again the whole fine tag of each
4473 LF; finally, in the transfer module, the fine tags of LFs are grouped
4474 again in more general categories (although it is also possible to
4475 define particularized categories) depending on the type of lexical
4476 forms that one wants to detect in patterns.}.
4478 Each \texttt{<\textbf{cat-item}>} element has a mandatory attribute
4479 \texttt{tags} whose value is a sequence of grammatical symbols
4480 separated by a dot; this sequence is a subsequence of the fine tag,
4481 that is, of the sequence of grammatical symbols that defines every
4482 possible lexical form delivered by the tagger. According to this, a
4483 category represents a certain set of lexical forms. We must define as
4484 many different categories as kinds of lexical forms we want to detect
4485 in patterns. Thus, if we want to detect all the nouns to perform
4486 certain actions on them, we will create a category defined with the
4487 grammatical symbol \texttt{n}. On the other hand, if we want to detect
4488 all the plural feminine nouns, we will have to define a category using
4489 the symbols \texttt{n} \texttt{f} and \texttt{pl}.
4493 When, for the set of lemmas we want to include in a category, a
4494 grammatical symbol used to define the category is followed by other
4495 grammatical symbols, the character \texttt{"*"} is used. For example,
4496 \texttt{tags}=\texttt{"n.*"} covers all the lexical forms that contain
4497 this symbol, such as the Spanish nouns \texttt{casa<n><f><pl>} or
4498 \texttt{coche<n><m><sg>}. On the other hand, when after the used
4499 symbol there can not be any other symbol, the asterisk is not
4500 included: for example, \texttt{tags}=\texttt{"}\texttt{adv"} will
4501 cover all adverbs, since in our system they are characterized with
4502 only one grammatical symbol. The asterisk can also be used to signal
4503 the existence of preceding symbols: \texttt{tags}=\texttt{"*.f.*"}
4504 includes all feminine lexical forms, whichever category they
4505 are. Furthermore, an optional attribute, \texttt{lemma}, can be used
4506 to define lexical forms on the basis of its lemma (see Figure
4507 \ref{fig:cat-item}).
4511 \begin{figure}
4512 \begin{small}
4513 \begin{alltt}
4514 <\textbf{def-cat} \textsl{n}="nom"/>
4515 <\textbf{cat-item} \textsl{tags}="n.*"/>
4516 </\textbf{def-cat}>
4518 <\textbf{def-cat} \textsl{n}="que"/>
4519 <\textbf{cat-item} \textsl{lemma}="que" \textsl{tags}="cnjsub"/>
4520 <\textbf{cat-item} \textsl{lemma}="que" \textsl{tags}="rel.an.mf.sp"/>
4521 </\textbf{def-cat}>
4522 \end{alltt}
4523 \end{small}
4524 \caption{Use of the \texttt{<\textbf{cat-item}>} element to define two
4525 categories, one for nouns without lemma specification (\emph{nom}),
4526 which includes all lexical forms whose first grammatical symbol is
4527 \emph{n}, and another one with associated lemma (\emph{que}), which
4528 has two subsequences of fine tags, to include the \emph{que}
4529 conjunction and the \emph{que} relative pronoun.}
4530 \label{fig:cat-item}
4531 \end{figure}
4534 \paragraph{Use in interchunk}
4537 It is used like in the \texttt{chunker} module, but here, instead of
4538 being defined with the grammatical symbols of lexical forms, it is
4539 defined with the symbols of the chunks delivered by the
4540 \texttt{chunker}. For example, in the case that we want to define a
4541 category to detect all the determined noun phrases, we will define it
4542 with the symbols \texttt{NP} and \texttt{DET} if this is how we tagged
4543 these chunks by means of the \texttt{<tag>} instructions contained in
4544 the \texttt{<chunk>} element (see \ref{ss:chunker}). You can also use
4545 the optional attribute \texttt{lemma} to refer to the
4546 \emph{pseudolemma} of the chunk. So, its formal characteristics are
4547 the same in the modules \texttt{chunker} and \texttt{interchunk}, with
4548 the difference that in the former they are used to detect lexical
4549 forms, and in the latter, to detect chunks.
4552 \paragraph{Use in postchunk}
4554 In this module, this element only has the mandatory attribute
4555 \texttt{name}, which refers to the name of the chunk,
4557 \nota{MG: abans deia 'al nom de la regla', comentari mlf: De la regla
4558 o del patró?} without tags, since in the \texttt{postchunk} module
4559 only the pseudolemma (name of the chunk) is used for detection. Case
4560 is ignored in detection, because the pseudolemma is used to convey
4561 information about the case of the chunk. (See Figure
4562 \ref{fig:cat-item-postchunk}).
4564 \begin{figure}
4565 \begin{small}
4566 \begin{alltt}
4567 <\textbf{def-cat} \textsl{n}="det-nom"/>
4568 <\textbf{cat-item} \textsl{name}="det-nom"/>
4569 </\textbf{def-cat}>
4570 \end{alltt}
4571 \end{small}
4572 \caption{Use of the \texttt{<\textbf{cat-item}>} element in the
4573 postchunk to detect chunks of determiner-noun.}
4574 \label{fig:cat-item-postchunk}
4575 \end{figure}
4579 \subsubsection{Element for category attribute definition section
4580 \\\texttt{<section-def-attrs>}}
4583 This section is to describe the attributes that will be extracted
4584 from the categories detected by the pattern and that will be used in
4585 the action part of the rules. Each attribute is defined by a
4586 \texttt{<\textbf{def-attr}>} tag.
4588 \nota{De vegades les etiquetes aprareixen en el text en negretes i de
4589 vegades sense negretes. Decidim-nos per una tipografia i usem-la en
4590 tot el document.}
4593 \subsubsection{Element for category attribute definition
4594 \\\texttt{<def-attr>}}
4596 Each \texttt{<\textbf{def-attr}>} defines an attribute regarding
4597 morphological information (both inflection information --gender,
4598 number, person, etc.--, and categorial --verb, adjective, etc--) by
4599 specifying a list of category attribute
4600 (\texttt{<\textbf{attr-item}>}) elements, and has a mandatory unique
4601 name \texttt{n}. Therefore, an attribute is defined on the basis of
4602 the grammatical symbols that can be found in a given lexical
4603 form. Each attribute extracts, from the lexical forms of the pattern,
4604 the symbols that these contain among the set of possible values
4605 defined.
4607 \subsubsection{Element for category attribute \texttt{<attr-item>}}
4609 Each category attribute element represents one of the possible values
4610 the attribute can take. For example, the attribute for number
4611 \texttt{nbr} can take the values singular \texttt{sg}, plural
4612 \texttt{pl}, singular--plural \texttt{sp} and number to be determined
4613 \texttt{ND}. These values are a subsequence of the morphological tags
4614 that characterize each lexical form, and are specified in the
4615 \texttt{tags} attribute of the element, separated by a dot if there is
4616 more than one. In Figure \ref{fig:attr-item} you can find an example
4617 for the attributes for \emph{number} and \emph{noun}. \nota{Potser
4618 s'hauria d'explicar per què s'ha triat el nom \emph{a\_nom} en la
4619 figura}
4621 Compare the definition of the attribute for number in this figure
4622 (with all possible values and without asterisks) with the definition
4623 of the category for noun in Figure \ref{fig:cat-item}.
4627 \begin{figure}
4628 \begin{small}
4629 \begin{alltt}
4630 <\textbf{def-attr} \textsl{n}="nbr"/>
4631 <\textbf{attr-item} \textsl{tags}="sg"/>
4632 <\textbf{attr-item} \textsl{tags}="pl"/>
4633 <\textbf{attr-item} \textsl{tags}="sp"/>
4634 <\textbf{attr-item} \textsl{tags}="ND"/>
4635 </\textbf{def-attr}>
4637 <\textbf{def-attr} \textsl{n}="a_nom"/>
4638 <\textbf{attr-item} \textsl{tags}="n"/>
4639 <\textbf{attr-item} \textsl{tags}="n.acr"/>
4640 </\textbf{def-attr}>
4642 \end{alltt}
4643 \end{small}
4644 \caption{Definition of the category attribute \texttt{nbr} for
4645 \emph{number}, which can take the values \emph{singular},
4646 \emph{plural}, \emph{singular-plural} or
4647 \emph{number to be determined}, and the category attribute
4648 \texttt{a\_nom} for \emph{noun}, which can take the values of the
4649 symbols \emph{n} or \emph{n acr}.}
4650 \label{fig:attr-item}
4651 \end{figure}
4654 \subsubsection{Element for variable definition section
4655 \\\texttt{<section-def-vars>}}
4657 In this section, \texttt{<\textbf{def-var}>} tags are used to define
4658 global string variables, that will be used to transfer information
4659 inside the rule and from one rule to another one (for example, to
4660 transmit information on gender or number between two patterns)
4663 \nota{Que quede clar que aquesta transferència d'una regla a altra es
4664 fa només d'una aplicació d'una regla a l'aplicació d'altra regla en un
4665 moment posterior, o d'esquerra a dreta}
4667 \subsubsection{Element for variable definition \texttt{<def-var>}}
4668 \label{ss:defvar} The definition of a global string variable has a
4669 mandatory unique name \texttt{n} that will be used to refer to it
4670 inside a rule. Variables contain strings that describe state
4671 information, such as the existence of agreement between two elements,
4672 the detection of a question mark in SL that should be deleted in TL,
4673 etc.
4676 \subsubsection{Element for string lists definition section
4677 \\\texttt{<section-def-lists>}} In this section, lists are defined
4678 (with \texttt{<\textbf{def-list}>} tags) that will be used to do
4679 string searches. These lists can be used to group word lemmas that
4680 have a common feature (i.e. verbs expressing movement, adjectives
4681 expressing emotions, etc.). This section is optional.
4683 \subsubsection{Element for string lists definition
4684 \texttt{<def-list>}} This element is used to name the string list,
4685 with the attribute \texttt{n}, and to encapsulate the list defined by
4686 one or more \texttt{<\textbf{list-item}>} elements. An example of its
4687 use can be found in Figure \ref{fig:deflist}.
4689 \subsubsection{Element for string list item \texttt{<list-item>}} It
4690 defines, with the value of the attribute \texttt{v}, the specific
4691 string that is included in the definition of the list. An example of
4692 its use can be found in Figure \ref{fig:deflist}.
4697 \begin{figure}
4698 \begin{small}
4699 \begin{alltt}
4700 <\textbf{def-list} n="verbos_est">
4701 <\textbf{list-item} v="actuar"/>
4702 <\textbf{list-item} v="buscar"/>
4703 <\textbf{list-item} v="estudiar"/>
4704 <\textbf{list-item} v="existir"/>
4705 <\textbf{list-item} v="ingressar"/>
4706 <\textbf{list-item} v="introduir"/>
4707 <\textbf{list-item} v="penetrar"/>
4708 <\textbf{list-item} v="publicar"/>
4709 <\textbf{list-item} v="treballar"/>
4710 <\textbf{list-item} v="viure"/>
4711 <\textbf{/def-list}>
4712 \end{alltt}
4713 \end{small}
4714 \caption{Definition of a list of Catalan lemmas. These lemmas are used
4715 in the rule in Figure \ref{fig:in}.}
4716 \label{fig:deflist}
4717 \end{figure}
4720 \subsubsection{Element for macro-instruction definition section
4721 \\\texttt{<section-def-macros>}}
4723 This section is for the definition of macro-instructions that contain
4724 pieces of code used frequently in the action part of the rules.
4726 \subsubsection{Element for macro-instruction definition
4727 \texttt{<def-macro>}}
4729 Each macro-instruction definition has a mandatory name (the value of
4730 the attribute \texttt{n}), the number of arguments it receives
4731 (attribute \texttt{npar}) and a body with instructions.
4734 \subsubsection{Element for rules section \texttt{<section-rules>}}
4736 This section contains the structural transfer rules, each one in a
4737 \texttt{<\textbf{rule}>} element.
4739 \subsubsection{Element for rule \texttt{<rule>}}
4741 Each rule has a pattern (\texttt{<\textbf{pattern}>}) and the
4742 associated action (\texttt{<\textbf{action}>}) performed when the
4743 pattern is matched.
4745 The rule can have an optional attribute \texttt{comment} with a
4746 comment on, usually, the function of the rule.
4748 \subsubsection{Element for pattern \texttt{<pattern>}}
4750 A pattern is specified using pattern items
4751 (\texttt{<\textbf{pattern-\\item}>}), each one of which corresponds to
4752 a lexical form in the matched pattern, in order of appearance.
4754 \subsubsection{Element for pattern constituent
4755 \texttt{<pattern-item>}}
4757 Each pattern item specifies, in the attribute with mandatory name
4758 \texttt{n}, which kind of lexical form is to be matched. To do that,
4759 one has to use the categories defined in
4760 \texttt{<\textbf{section-def-cats}>} (see in Figure \ref{fig:regla}
4761 the definition of a pattern for determiner--noun ).
4764 \subsubsection{Element for action \texttt{<action>}}
4766 This element contains the ``instructions'' that have to be executed to
4767 process as desired each matched pattern.
4769 The processing part for matched patterns is a block of zero or more
4770 instructions of the kind: \texttt{<\textbf{choose}>} (conditional
4771 processing), \texttt{<\textbf{let}>} (value assignment),
4772 \texttt{<\textbf{out}>} (print TL lexical forms),
4773 \texttt{<\textbf{modify-case}>} (modify case state of a lexical form),
4774 \texttt{<\textbf{call-macro}>} (call a macro-instruction) and
4775 \texttt{<\textbf{append}>} (concatenate strings).
4778 Through the processing step, depending on whether a series of
4779 conditional options are met or not, different operations are carried
4780 out, such as creating agreement between pattern components, necessary
4781 when these undergo gender or number changes in the lexical transfer
4782 process. To do this, in spite of working with TLLF, also the SL
4783 information is taken into account, since, for example, if pattern
4784 components do not agree in SL, maybe they do not have to agree in TL
4785 either. As a consequence of the application of the different
4786 operations in a pattern, values are assigned to pattern attributes
4787 and, if applicable, to global or state variables, and the information
4788 on the resulting TL pattern is sent to the next module (the
4789 morphological generator in a shallow-transfer system, or the next
4790 transfer module in an advanced transfer system).
4793 \subsubsection{Element for macro-instruction call
4794 \texttt{<call-macro>}}
4796 In a rule it is possible to call any of the macro-instructions defined
4797 in \texttt{<\textbf{section-def-macros}>}. To do this, one has to
4798 specify the name of the macro-instruction in the \texttt{n} attribute,
4799 and one or more arguments in the parameter element
4800 \texttt{<\textbf{with-param}>} (see next).
4802 \subsubsection{Element for parameters \texttt{<with-param>}}
4804 This element is used inside a macro-instruction call
4805 \texttt{<\textbf{call-macro}>}. The \texttt{pos} attribute of an
4806 argument is used to refer to a lexical form of the rule from where the
4807 macro-instruction is called. For example, if a macro-instruction with
4808 2 parameters has been defined, to make agreement operations between
4809 noun--adjective, it can be used with arguments 1 and 2 in a rule for
4810 noun--adjective, with arguments 2 and 3 in a rule for
4811 determiner--noun--adjective, with arguments 1 and 3 in a rule for
4812 noun--adverb--adjective and with arguments 2 and 1 in a rule for
4813 adjective--noun. You can see an example of macro-instruction call in
4814 Figure \ref{fig:macro}.
4816 \begin{figure}
4817 \begin{small}
4818 \begin{alltt}
4819 <\textbf{call-macro} n="f_concord2">
4820 <\textbf{with-param} pos="3"/>
4821 <\textbf{with-param} pos="1"/>
4822 <\textbf{/call-macro}>
4823 \end{alltt}
4824 \end{small}
4825 \caption{Call of the macro-instruction \texttt{f-concord2} designed to
4826 create agreement between two elements in a pattern such as
4827 determiner--adverb--noun. Propagation of gender and number is done
4828 from one of the components, in this case, from the noun which is the
4829 third element of the pattern (3). Therefore, the position of the noun
4830 is the first parameter given, and the other parameters come
4831 next. Since the adverb (in position 2) does not need agreement
4832 information, only the position of the determiner is specified (1).}
4833 \label{fig:macro}
4834 \end{figure}
4838 \subsubsection{Element for selection \texttt{<choose>}}
4839 \label{choose}
4841 The selection instruction consists of one or more conditional options
4842 (\texttt{<\textbf{when}>}) and an alternative option
4843 \texttt{<\textbf{otherwise}>}, which is optional.
4846 \subsubsection{Element for condition \texttt{<when>}}
4848 This element describes a conditional option (see Section
4849 \ref{choose}). It contains the condition to be tested
4850 \texttt{<\textbf{test}>} and one block of zero or more instructions of
4851 the kind \texttt{<\textbf{choose}>}, \texttt{<\textbf{let}>},
4852 \texttt{<\textbf{out}>}, \texttt{<\textbf{modify-case}>},
4853 \texttt{<\textbf{call-macro}>} or \texttt{<\textbf{append}>}, \nota{OK
4854 append?} which will be executed if the above condition is met.
4856 \subsubsection{Element for alternative option \texttt{<otherwise>}}
4858 The element \texttt{<\textbf{otherwise}>} contains one block of one or
4859 more instructions (of the kind \texttt{<\textbf{choose}>},
4860 \texttt{<\textbf{let}>}, \texttt{<\textbf{out}>},
4861 \texttt{<\textbf{modify-case}>}, \texttt{<\textbf{call-macro}>} and
4862 \texttt{<\textbf{append}>}) that must be executed if none of the
4863 conditions described in the \texttt{<\textbf{when}>} elements of a
4864 \texttt{<\textbf{choose}>} is met.
4866 \subsubsection{Element for evaluation \texttt{<test>}}
4868 The test element \texttt{<\textbf{test}>} in a condition element
4869 \texttt{<\textbf{when}>} can contain a conjunction
4870 (\texttt{<\textbf{and}>}), a disjunction (\texttt{<\textbf{or}>}) or a
4871 negation (\texttt{<\textbf{not}>}) of conditions to be tested, as well
4872 as a simple condition of string equality (\texttt{<\textbf{equal}>}),
4873 string beginning (\texttt{<\textbf{begins-with}>}), string end
4874 (\texttt{<\textbf{ends-with}>}), substring
4875 (\texttt{<\textbf{contains-substring}>}) or inclusion in a set
4876 (\texttt{<\textbf{in}>}).
4878 \nota{Segur que es pot millorar la redacció de l'últim paràgraf,
4879 canviat per mlf perquè hi estiguen totes les condicions booleanes
4880 simples.}
4882 \subsubsection{Elements for conditional or boolean operators:
4883 \texttt{<equal>}, \texttt{<and>}, \texttt{<or>}, \texttt{<not>},
4884 \texttt{<in>}}
4886 \nota{To be completed: add \texttt{contains-substring},
4887 \texttt{ends-with}, \texttt{begins-with}, etc.}
4889 \begin{itemize}
4891 \item The conjunction element \texttt{<\textbf{and}>} represents a
4892 condition, consisting of two or more conditions, that is met when all
4893 included conditions are true. An example of its use can be found in
4894 Figure \ref{fig:regla}.
4896 \item The disjunction element \texttt{<\textbf{or}>} represents a
4897 condition, consisting of two or more conditions, that is met when at
4898 least one of the included conditions is true. Figure \ref{fig:ornot}
4899 displays an example of this condition type used when testing gender
4900 agreement in a SL pattern.
4902 \item The negation element \texttt{<\textbf{not}>} represents a
4903 condition that is met when the included condition is not met, and vice
4904 versa. An example of negation of an equality can be found in Figure
4905 \ref{fig:ornot}.
4907 \item The conditional equality operator \texttt{<\textbf{equal}>} is
4908 an instruction that evaluates if two arguments (two strings) are
4909 identical or not. See examples of its use in Figures \ref{fig:clip}
4910 and \ref{fig:lit-tag}. In addition, this operator can have the
4911 attribute \texttt{caseless}, which, when its value is \texttt{yes},
4912 causes the comparison of strings to be made ignoring case. \nota{All
4913 string conditional tests have the attribute \texttt{caseless}; also
4914 \texttt{in} described below}
4916 \item The "search in lists" operator \texttt{<\textbf{in}>} is used to
4917 search for any value (specified as the first parameter of the condition)
4918 in a list referred to by the \texttt{n} attribute of the
4919 \texttt{<\textbf{list}>} element; this list must be defined in the
4920 appropriate section (\texttt{<\textbf{section-def-lists}}). The
4921 search result is true if the value is found in the list. This
4922 comparison can also use the attribute \texttt{caseless}: if its value
4923 is \texttt{yes}, the search is done ignoring case. Figure \ref{fig:in}
4924 shows an example of its use.
4926 \end{itemize}
4928 \nota{Cal unificar tota la discussió anterior, traient factor comú.}
4930 \nota{Cal descriure la resta d'elements condicionals que no hi són.}
4933 \subsubsection{Element \texttt{<clip>}}
4934 \label{ss:clip}
4937 The \texttt{<\textbf{clip}>} element represents a substring of a SL or
4938 TL lexical form, defined by the value of its different attributes (see an
4939 example in Figure \ref{fig:clip}):
4941 \begin{itemize}
4942 \item \texttt{pos} is an index (1, 2, 3, etc.) used to select a
4943 lexical form inside a rule: it refers to the place the lexical form
4944 occupies in the pattern. In the \textit{postchunk} module there is
4945 also the index ``0'', which refers to the pseudolemma of the chunk
4946 \nota{MG: is it not "lexical pseudoform"?}, which is treated as a word
4947 by itself in order to be able to consult its information and make
4948 decisions from this.
4950 \item \texttt{side} \textit{(only in the \texttt{chunker} module)}
4951 specifies if the selected \emph{clip} is from the source language
4952 (\texttt{sl}) or from the target language (\texttt{tl}).
4954 \item \texttt{part} indicates which part of the lexical form is
4955 processed; generally its value is one of the attributes defined in
4956 \texttt{<\textbf{section-def-\\attrs}>} (\texttt{gen}, \texttt{nbr},
4957 etc.), although it can also take four predefined values: \texttt{lem}
4958 (refers to the lemma of the lexical form), \texttt{lemh} (the first
4959 part of a split lemma), \texttt{lemq} (the queue of a split lemma),
4960 and \texttt{whole} (the whole lexical form, including lemma and all
4961 grammatical symbols, which may have been modified in the preceding
4962 part of the rule).
4964 \item \texttt{link-to} \textit{(only in the \texttt{chunker} module in
4965 advanced mode)} replaces the value that would result from consulting
4966 the rest of the attributes of the clip, by the value specified in
4967 this attribute, which must be a natural number ($>0$). \nota{MG:
4968 explain the new characteristics - Sergio?} This number indicates to
4969 which \texttt{<\textbf{tag}>} of the \texttt{<\textbf{chunk}>} is
4970 linked the clip content, the number being the order this tag
4971 occupies inside the element \texttt{<\textbf{tags}>}. The other
4972 attributes of the clip remain only for informational purposes, since
4973 they are overwritten by the value of the linked tag. An example of
4974 its use can be found in Figure \ref{fig:chunkintrachunk}.
4976 \end{itemize}
4979 \begin{figure}
4980 \begin{small}
4981 \begin{alltt}
4982 <\textbf{test}>
4983 <\textbf{not}>
4984 <\textbf{equal}>
4985 <\textbf{clip} \textsl{pos}="2" \textsl{side}="tl" \textsl{part}="gen"/>
4986 <\textbf{clip} \textsl{pos}="2" \textsl{side}="sl" \textsl{part}="gen"/>
4987 <\textbf{/equal}>
4988 <\textbf{/not}>
4989 <\textbf{/test}>
4990 \end{alltt}
4991 \end{small}
4992 \caption{Extract from a rule where it is tested whether the TL
4993 (\texttt{tl}) gender (\texttt{gen}) of the second lexical unit
4994 identified in a pattern is different from the gender of the same
4995 lexical unit in the SL (\texttt{sl})}.
4996 \label{fig:clip}
4997 \end{figure}
5001 \subsubsection{Element for literal string \texttt{<lit>}} This element
5002 is used to specify the value of a literal string by means of the
5003 attribute \texttt{v}. For example, \texttt{<\textbf{lit}
5004 v=\texttt{"}andar\texttt{"}/>} represents the string \emph{andar}.
5007 \subsubsection{Element for tag value \texttt{<lit-tag>}} It is similar
5008 to the \texttt{<\textbf{lit}>} element, with the difference that it
5009 does not specify the value of a literal string but the value of a
5010 grammatical symbol or tag, by means of the attribute \texttt{v}. An
5011 example of its use can be found in Figure \ref{fig:lit-tag}.
5014 \begin{figure}
5015 \begin{small}
5016 \begin{alltt}
5017 <\textbf{equal}>
5018 <\textbf{clip} \textsl{pos}="2" \textsl{side}="tl" \textsl{part}="nbr"/>
5019 <\textbf{lit-tag} \textsl{v}="ND"/>
5020 <\textbf{/equal}>
5021 \end{alltt}
5022 \end{small}
5023 \caption{Use of the element \texttt{<\textbf{lit-tag}>}: it is tested
5024 whether the number (\texttt{nbr}) symbol of the second
5025 lexical unit in the TL (\texttt{tl}) is \texttt{ND} (number to be
5026 determined)}
5027 \label{fig:lit-tag}
5028 \end{figure}
5030 \begin{figure}
5031 \begin{small}
5032 \begin{alltt}
5033 <\textbf{test}>
5034 <\textbf{or}>
5035 <\textbf{not}>
5036 <\textbf{equal}>
5037 <\textbf{clip} \textsl{pos}="1" \textsl{side}="sl" \textsl{part}="gen"/>
5038 <\textbf{clip} \textsl{pos}="3" \textsl{side}="sl" \textsl{part}="gen"/>
5039 <\textbf{/equal}>
5040 <\textbf{/not}>
5041 <\textbf{not}>
5042 <\textbf{equal}>
5043 <\textbf{clip} \textsl{pos}="2" \textsl{side}="sl" \textsl{part}="gen"/>
5044 <\textbf{clip} \textsl{pos}="3" \textsl{side}="sl" \textsl{part}="gen"/>
5045 <\textbf{/equal}>
5046 <\textbf{/not}>
5047 <\textbf{/or}>
5048 <\textbf{/test}>
5049 \end{alltt}
5050 \end{small}
5051 \caption{Extract from a rule where it is tested whether the SL gender
5052 of the first or the second lexical unit matched in a pattern (it
5053 could be, for example, determiner--adjective--noun) is different
5054 from the gender of the third lexical unit also in the SL.}
5055 \label{fig:ornot}
5056 \end{figure}
5060 \subsubsection{Element for variable \texttt{<var>}}
5063 Each \texttt{<\textbf{var}>} is a variable identifier: the mandatory
5064 attribute \texttt{n} specifies its name as has been defined in
5065 \texttt{<\textbf{section-def-vars}>}. When it appears in an
5066 \texttt{<\textbf{out}>}, a \texttt{<\textbf{test}>}, or the right part
5067 of a \texttt{<\textbf{let}>}, it represents the value of the variable;
5068 when it appears on the left side of a \texttt{<\textbf{let}>}, in an
5069 \texttt{<\textbf{append}>} or in a \texttt{<\textbf{modify-case}>}, it
5070 represents the reference of the variable and its value can be changed.
5072 \subsubsection{Element for reference to string list \texttt{<list>}}
5074 This element is only used as the second parameter of a
5075 \texttt{<\textbf{in}>} search. The \texttt{n} attribute refers to the
5076 specific list defined in the string lists definition section
5077 \texttt{<\textbf{section-def-lists}>}. An example of its use can be found in
5078 Figure \ref{fig:in}.
5081 \begin{figure}
5082 \begin{small}
5083 \begin{alltt}
5084 <\textbf{rule}>
5085 <\textbf{pattern}>
5086 <\textbf{pattern-item} \textsl{n}="verb"/>
5087 <\textbf{pattern-item} \textsl{n}="a"/>
5088 <\textbf{/pattern}>
5089 <\textbf{action}>
5090 <\textbf{choose}>
5091 <\textbf{when}>
5092 <\textbf{test}>
5093 <\textbf{in} \textsl{caseless}="yes"/>
5094 <\textbf{clip} \textsl{pos}="1" \textsl{side}="sl" \textsl{part}="lem"/>
5095 <\textbf{list} \textsl{n}="verbos_est"/>
5096 <\textbf{/in}>
5097 <\textbf{/test}>
5098 <\textbf{let}>
5099 <\textbf{clip} \textsl{pos}="2" \textsl{side}="tl" \textsl{part}="lem"/>
5100 <\textbf{lit} \textsl{v}="en"/>
5101 <\textbf{/let}>
5102 <\textbf{/when}>
5103 <!-- ... -->
5104 \end{alltt}
5105 \end{small}
5106 \caption{Extract of a rule that detects a pattern made of a verb and
5107 the preposition \emph{a}, and then testes whether the verb (the
5108 lemma indicated in \texttt{lem}) of the source language
5109 (\texttt{sl}) is one of the lemmas included in the list of state
5110 verbs (defined in Figure \ref{fig:deflist}). If that be the case,
5111 the lemma of the second word in target language (\texttt{tl}) is
5112 changed to \emph{en}.}
5113 \label{fig:in}
5114 \end{figure}
5117 \subsubsection{Element for case application \texttt{<get-case-from>}}
5119 The \texttt{<\textbf{get-case-from}>} element represents the string
5120 obtained after applying the letter case state of the lemma of a SL
5121 lexical unit to a string (\emph{clip}, \emph{lit} or \emph{var}). To
5122 refer to the lexical unit from where the information is taken, the
5123 attribute \texttt{pos} is used, which indicates the position of that
5124 unit in the SL. This element is useful when the lexical units in a
5125 pattern are reordered, or when a lexical unit is added or deleted. You
5126 can see an example of its use in Figure \ref{fig:case}, which displays
5127 a rule to transform the simple perfect preterite tense in Spanish
5128 (\emph{dije}, "I said") into the compound form in Catalan (\emph{vaig
5129 dir}). In this rule, a LF with lemma \emph{anar} and grammatical
5130 symbol \emph{vaux} ("auxiliary verb") is added; it has to take the
5131 case information from the Spanish verb (which has position "1" in the
5132 pattern), so that the system translates \emph{Dije} as \emph{Vaig
5133 dir}, \emph{dije} as \emph{vaig dir} and \emph{DIJE} as \emph{VAIG
5134 DIR}.
5137 \subsubsection{Element for case pattern query \texttt{<case-of>}}
5139 It is used to get the case pattern of a string, that is, one of the
5140 values "\texttt{aa}", "\texttt{Aa"} or "\texttt{AA}". It works like the
5141 \texttt{<\textbf{clip}>} element, since it has the same attributes:
5142 \texttt{pos}, the position of the word in the matched pattern;
5143 \texttt{part}, the specific attribute that we refer to (normally the
5144 lemma), which has the predefined attributes described in Section
5145 \ref{ss:clip}, and finally, only in the \texttt{chunker} module, the
5146 attribute \texttt{side}, referring to the translation side,
5147 \texttt{sl} or \texttt{tl}. In Figure \ref{fig:case} you can see this
5148 element in use, and you can find a more detailed description of this
5149 example in the following Section (description of
5150 \texttt{<\textbf{modify-case}>}).
5153 \subsubsection{Element for case modification \texttt{<modify-case>}}
5155 This instructions is used to modify the case of the first parameter
5156 (usually a lemma) by means of the second parameter (a literal or a
5157 variable). The first parameter can be a \texttt{<\textbf{var}>}, a
5158 \texttt{<\textbf{clip}>} or a \texttt{<\textbf{case-of}>}, whereas the
5159 second one can be anything that delivers a value, but in principle it
5160 will be a \texttt{<\textbf{var}>} or a \texttt{<\textbf{lit}>}. The
5161 values that this value can take are usually ``\texttt{Aa}'', to
5162 express that the ``left part'' of this case modification must have the
5163 first letter in upper case and the rest in lower case, ``\texttt{aa}''
5164 to put all in lower case, and ``\texttt{AA}'' to put all in upper
5165 case.
5167 Figure \ref{fig:case} shows a rule where this element is used. It
5168 modifies in this rule the case of the TL lemma in position "1", which
5169 corresponds to \emph{dir}, because, although in the rule output this
5170 verb is the second lexical form (\emph{vaig dir}), it is actually the
5171 translation of the LF which has position 1 in the SL, and, therefore,
5172 it retains the same assigned position in the TL. This lemma is
5173 assigned the value ``\texttt{aa}'' in the case that the SL lemma has
5174 the state ``\texttt{Aa}''. There is nothing to specify for the rest of
5175 the cases, since the case state of the LF in position 1 will be the
5176 same in the SL and in the TL and, therefore, will be automatically
5177 transferred (see Section~\pageref{mayusc} to obtain more information
5178 on letter case handling in dictionaries ).
5181 \begin{figure}
5182 \begin{small}
5183 \begin{alltt}
5184 <\textbf{rule}>
5185 <\textbf{pattern}>
5186 <\textbf{pattern-item} n="pretind"/>
5187 <\textbf{/pattern}>
5188 <\textbf{action}>
5189 <\textbf{out}>
5190 <\textbf{lu}>
5191 <\textbf{get-case-from} pos ="1">
5192 <\textbf{lit} v="anar"/>
5193 <\textbf{/get-case-from}>
5194 <\textbf{lit-tag} v="vaux"/>
5195 <\textbf{clip} pos="1" side="sl" part="persona"/>
5196 <\textbf{clip} pos="1" side="sl" part="nbr"/>
5197 <\textbf{/lu}>
5198 <\textbf{b/}>
5199 <\textbf{/out}>
5200 <\textbf{choose}>
5201 <\textbf{when}>
5202 <\textbf{test}>
5203 <\textbf{equal}>
5204 <\textbf{case-of} pos="1" side="sl" part="lemh"/>
5205 <\textbf{lit} v="Aa"/>
5206 <\textbf{/equal}>
5207 <\textbf{/test}>
5208 <\textbf{modify-case}>
5209 <\textbf{case-of} pos="1" side="tl" part="lemh"/>
5210 <\textbf{lit} v="aa"/>
5211 <\textbf{/modify-case}>
5212 <\textbf{/when}>
5213 <\textbf{/choose}>
5214 <\textbf{out}>
5215 <\textbf{lu}>
5216 <\textbf{clip} pos="1" side="tl" part="lemh"/>
5217 <\textbf{clip} pos="1" side="tl" part="a_verb"/>
5218 <\textbf{lit-tag} v="inf"/>
5219 <\textbf{clip} pos="1" side="tl" part="lemq"/>
5220 <\textbf{/lu}>
5221 <\textbf{/out}>
5222 <\textbf{/action}>
5223 <\textbf{/rule}>
5224 \end{alltt}
5225 \end{small}
5226 \caption{Rule for the translation from Spanish into Catalan, which
5227 turns the verbs in simple perfect preterite tense (\emph{dije}) into
5229 compound perfect preterite tense usual in Catalan (\emph{vaig dir}),
5230 and at the same time assigns the appropriate case state
5231 to the two resulting words.}
5232 \label{fig:case}
5233 \end{figure}
5237 \subsubsection{Element for assignment \texttt{<let>}}
5239 The assignment instruction \texttt{<\textbf{let}>} assigns the value
5240 of the right part of the assignment (a literal string, a
5241 \texttt{clip}, a variable, etc.) to the left part (a \texttt{clip}, a
5242 variable, etc.). An example of its use can be found in Figure
5243 \ref{fig:regla}.
5247 \begin{figure}
5248 \begin{small}
5249 \begin{alltt}
5250 <\textbf{rule}>
5251 <\textbf{pattern}>
5252 <\textbf{pattern-item} n="det"/>
5253 <\textbf{pattern-item} n="nom"/>
5254 <\textbf{/pattern}>
5255 <\textbf{action}>
5256 <\textbf{choose}>
5257 <\textbf{when}>
5258 <\textbf{test}>
5259 <\textbf{and}>
5260 <\textbf{not}>
5261 <\textbf{equal}>
5262 <\textbf{clip} pos="2" side="tl" part="gen"/>
5263 <\textbf{clip} pos="2" side="sl" part="gen"/>
5264 <\textbf{/equal}>
5265 <\textbf{/not}>
5266 <\textbf{not}>
5267 <\textbf{equal}>
5268 <\textbf{clip} pos="2" side="tl" part="gen"/>
5269 <\textbf{lit-tag} v="mf"/>
5270 <\textbf{/equa}l>
5271 <\textbf{/not}>
5272 <\textbf{not}>
5273 <\textbf{equal}>
5274 <\textbf{clip} pos="2" side="tl" part="gen"/>
5275 <\textbf{lit-tag} v="GD"/>
5276 <\textbf{/equal}>
5277 <\textbf{/not}>
5278 <\textbf{/and}>
5279 <\textbf{/test}>
5280 <\textbf{let}>
5281 <\textbf{clip} pos="1" side="tl" part="gen"/>
5282 <\textbf{clip} pos="2" side="tl" part="gen"/>
5283 <\textbf{/let}>
5284 <\textbf{/when}>
5285 <\textbf{/choose}>
5286 <!-- Other gender and number agreement actions -->
5287 \end{alltt}
5288 \end{small}
5289 \caption{Extract from a rule for the pattern \texttt{determiner--noun}
5290 (continues in Fig. \ref{fig:regla2}): in this part of the rule, the
5291 gender of the noun is assigned to the determiner in the case that
5292 the gender of the noun changes from the SL (\texttt{sl}) to the TL
5293 (\texttt{tl}) during the lexical transfer process between both
5294 languages.}
5295 \label{fig:regla}
5296 \end{figure}
5298 \subsubsection{Element for string concatenation \texttt{<concat>}}
5300 This element is used to concatenate strings in order to assign them to
5301 a variable. It is used in combination with \texttt{<\textbf{let}>},
5302 and the previous value of the variable is lost with the assignment of
5303 \texttt{<\textbf{concat}>}.
5305 It does not have any attribute. It can contain any instruction that
5306 delivers a string, such as \texttt{<\textbf{lit}>},
5307 \texttt{<\textbf{lit-tag}>} or \texttt{<\textbf{clip}>}.
5309 Figure \ref{fig:concat} shows an example of its use.
5312 \begin{figure}
5313 \begin{small}
5314 \begin{alltt}
5315 <\textbf{let}>
5316 <\textbf{var} n="palabra"/>
5317 <\textbf{concat}>
5318 <\textbf{clip} pos="3" side="tl" part="lem"/>
5319 <\textbf{lit-tag} v="adj"/>
5320 <\textbf{/concat}>
5321 <\textbf{/let}>
5322 \end{alltt}
5323 \end{small}
5324 \caption{In this example, the variable \texttt{palabra} is assigned
5325 the value of the concatenation of a \texttt{clip} (the lemma in
5326 position 3) and the \emph{adj} tag.}
5327 \label{fig:concat}
5328 \end{figure}
5333 \subsubsection{Element for string concatenation \texttt{<append>}}
5335 The \texttt{<\textbf{append}>} instruction can be used to save the
5336 output of an action before printing it in the corresponding
5337 \texttt{<\textbf{out}>}, if required by the designer of the transfer
5338 rules.
5340 The mandatory attribute \texttt{n} specifies the name of the variable
5341 used. After applying the instruction, the previous content of the
5342 referred variable will be the prefix of the new content, that is, the
5343 new content inserted in the \texttt{<\textbf{append}>} will be
5344 concatenated to the pre-existing content of the variable specified in
5345 \texttt{n}.
5347 The content of this instruction can be one or more of the following
5348 tags: \texttt{<\textbf{b}>}, \texttt{<\textbf{clip}>},
5349 \texttt{<\textbf{lit}>}, \texttt{<\textbf{lit-tag}>},
5350 \texttt{<\textbf{var}>}, \texttt{<\textbf{get-case-from}>},
5351 \texttt{<\textbf{case-of}>} or \texttt{<\textbf{concat}>}. There is an
5352 example of its use in Figure \ref{fig:append}.
5354 \begin{figure}
5355 \begin{small}
5356 \begin{alltt}
5357 <\textbf{append} n="temporal">
5358 <clip pos="3" part="gen" side="tl"/>
5359 <\textbf{/append}>
5360 \end{alltt}
5361 \end{small}
5362 \caption{In this example, the variable \texttt{temporal} is assigned
5363 the value of the gender, in the TL, of the third word matched by the
5364 rule.}
5365 \label{fig:append}
5366 \end{figure}
5371 \subsubsection{Element for output \texttt{<out>}}
5373 \label{ss:out} The output instruction is used to specify the lexical
5374 forms that are sent at the output of the module after having been
5375 applied the required structural transfer operations. Its use is
5376 different according to the module. On the one hand, its use in the
5377 \texttt{chunker} module when it runs as only module (shallow-transfer)
5378 and its use in the \texttt{postchunk} module are similar, since in
5379 both cases, the output must be the input for the generator. The
5380 \texttt{chunker} in Apertium 2 and the \texttt{interchunk} have
5381 different use modes: the former to create the chunks, and the latter
5382 to modify the chunks without modifying its internal part.
5384 \begin{enumerate}
5386 \item \textbf{Use in \texttt{chunker} in shallow-transfer mode, and in
5387 \texttt{postchunk}}
5389 The instruction sends each lexical form inside a
5390 \texttt{<\textbf{lu}>} set, which in turn can be contained inside a
5391 \texttt{<\textbf{mlu}>} element when the output is a multiword made
5392 of two or more LF. Besides, also the blanks or superblanks
5393 (\texttt{<\textbf{b}>}) between LF and LF are sent. You can find an
5394 example of its use in Figures \ref{fig:case} and \ref{fig:regla2}.
5396 \begin{figure}
5397 \begin{small}
5398 \begin{alltt}
5399 <!-- ... -->
5400 <\textbf{out}>
5401 <\textbf{lu}>
5402 <\textbf{clip} pos="1" side="tl" part="whole"/>
5403 <\textbf{/lu}>
5404 <\textbf{lu}>
5405 <\textbf{clip} pos="2" side="tl" part="whole"/>
5406 <\textbf{/lu}>
5407 <\textbf{/out}>
5408 <\textbf{/process}>
5409 <\textbf{/action}>
5410 <\textbf{/rule}>
5411 \end{alltt}
5412 \end{small}
5413 \caption{Extract from a rule (comes from Fig. \ref{fig:regla}). At the
5414 end of the rule, and after different actions, the resulting data are
5415 sent by means of the attribute \texttt{whole}, which contains the
5416 lemma and the grammatical symbols of each LF (positions 1 and 2 in
5417 the pattern).}
5418 \label{fig:regla2}
5419 \end{figure}
5422 \item \textbf{Use in \texttt{chunker} in advanced mode}
5424 The output of this module is expected to be a sequence of one or
5425 more chunks (sent inside a \texttt{<\textbf{chunk}>} element)
5426 separated by blanks \texttt{<\textbf{b}>}. Lexical forms and
5427 multiforms, as well as the blanks between them, are sent inside
5428 chunks. You can see in Figure \ref{fig:chunkintrachunk} an example
5429 of use.
5432 \begin{figure}
5433 \begin{small}
5434 \begin{alltt}
5435 <\textbf{out}>
5436 <\textbf{chunk} name="pr" case="caseFirstWord">
5437 <\textbf{tags}>
5438 <\textbf{tag}><\textbf{lit-tag} v="PREP"/><\textbf{/tag}>
5439 <\textbf{/tags}>
5440 <\textbf{lu}>
5441 <\textbf{clip} pos="1" side="tl" part="whole"/>
5442 <\textbf{/lu}>
5443 <\textbf{/chunk}>
5444 <\textbf{b} pos="1"/>
5445 <\textbf{chunk} name="probj" case="caseOtherWord">
5446 <\textbf{tags}>
5447 <\textbf{tag}><\textbf{lit-tag} v="NP"/><\textbf{/tag}>
5448 <\textbf{tag}><\textbf{lit-tag} v="tn"/><\textbf{/tag}>
5449 <\textbf{tag}><\textbf{clip} pos="2" side="tl" part="pers"/><\textbf{/tag}>
5450 <\textbf{tag}><\textbf{clip} pos="2" side="tl" part="gen"/><\textbf{/tag}>
5451 <\textbf{tag}><\textbf{clip} pos="2" side="tl" part="nbr"/><\textbf{/tag}>
5452 <\textbf{/tags}>
5453 <\textbf{lu}>
5454 <\textbf{clip} pos="2" side="tl" part="lem"/>
5455 <\textbf{lit-tag} v="prn"/>
5456 <\textbf{lit-tag} v="2"/>
5457 <\textbf{clip} pos="2" side="tl" part="pers"/>
5458 <\textbf{clip} pos="2" side="tl" part="gen" link-to="4"/>
5459 <\textbf{clip} pos="2" side="tl" part="nbr" link-to="5"/>
5460 <\textbf{/lu}>
5461 <\textbf{/chunk}>
5462 <\textbf{/out}>
5463 \end{alltt}
5464 \end{small}
5465 \caption{Output instruction that sends two chunks separated by a
5466 blank. The printed sequence is a preposition followed by a noun
5467 phrase ("NP"). The tags that are linked from the second chunk to the outside are
5468 pronoun type ("tn"), gender and number of the noun phrase
5469 (pronoun). The \texttt{<\textbf{tag}>} elements are used to specify
5470 the tags of the chunk, and the value of the attributes \texttt{name}
5471 and \texttt{case} is used to specify the pseudolemma of the chunk.}
5472 \label{fig:chunkintrachunk}
5473 \end{figure}
5476 \item \textbf{Use in \texttt{interchunk}}
5478 In this module, lexical forms (words) are inaccessible, since it is
5479 only possible to operate with chunks and, therefore, inside an
5480 \texttt{<\textbf{out}>} element you can only put
5481 \texttt{<\textbf{chunk}>} elements or blanks \texttt{<\textbf{b}>}.
5482 The information on lemma and tags specified here in a \texttt{<\textbf{chunk}>}
5483 element refers exclusively to the lemma (pseudolemma) and the tags of
5484 the chunk.
5486 An example of its use can be found in Figure
5487 \ref{fig:chunkinterchunk}.
5489 \begin{figure}
5490 \begin{small}
5491 \begin{alltt}
5492 <\textbf{out}>
5493 <\textbf{b} pos="1"/>
5494 <\textbf{chunk}>
5495 <\textbf{clip} pos="2" part="lem"/>
5496 <\textbf{clip} pos="2" part="tags"/>
5497 <\textbf{clip} pos="2" part="chcontent"/>
5498 <\textbf{/chunk}>
5499 <\textbf{/out}>
5500 \end{alltt}
5502 \end{small}
5503 \caption{The aim of this rule output is to discard the first chunk of
5504 the matched pattern (pronoun drop). The three
5505 \texttt{<\textbf{clip}>} elements have been included here for
5506 illustrative purposes, since they could have been replaced by the
5507 \texttt{part="whole"} which would group them in a single
5508 \texttt{<\textbf{clip}>} .}
5509 \label{fig:chunkinterchunk}
5510 \end{figure}
5514 \end{enumerate}
5519 \subsubsection{Element for lexical unit \texttt{<lu>}}
5521 \label{ss:lu} This is the element by means of which each TLLF is sent out at the
5522 end of a rule, inside an \texttt{<\textbf{out}>} element.
5523 With this element, one can send the whole lexical form, using the
5524 attribute \texttt{whole} of a \texttt{<\textbf{clip}>}, or, if
5525 required, specify its parts separately (lemma plus tags, indicated by
5526 means of \texttt{<\textbf{clip}>} strings, literal strings
5527 \texttt{<\textbf{lit}>}, tags \texttt{<\texttt{\textbf{lit-tag}}>},
5528 variables \texttt{<\texttt{\textbf{var}}>}, besides case information
5529 [\texttt{<\textbf{get-case-from}>}, \texttt{<\textbf{case-of}>}]).
5533 Please note that, as has been explained before, in the case of
5534 multiwords with \emph{split lemma} it is necessary to replace the
5535 lemma queue \emph{after} the grammatical symbols of the inflected word
5536 (or lemma head), because the \texttt{pretransfer} module has moved the
5537 queue to put it before the grammatical symbols of the head. This
5538 replacement is done here, inside the \texttt{<\textbf{lu}>} element,
5539 using the values \texttt{lemh} and \texttt{lemq} of the attribute
5540 \texttt{part} in a \texttt{<\textbf{clip}>}. The \texttt{lemh}
5541 attribute refers to the lemma head, and \texttt{lemq} to the lemma
5542 queue. As can be seen in the example \ref{fig:case}, the \texttt{lemq}
5543 part of a \texttt{<\textbf{clip}>} is placed after the lemma head and
5544 all the grammatical symbols that follow it. This rule would be
5545 suitable, for example, for the Spanish form \emph{eché de menos} ("I
5546 missed"), which has to be translated into Catalan as \emph{vaig trobar
5547 a faltar}. The attribute \texttt{a\_verb} which comes after
5548 \texttt{lemh} contains the grammatical symbol that describes the verb
5549 category (\emph{vblex}, \emph{vbser}, \emph{vbhaver} or \emph{vbmod}
5550 as applicable). Therefore, the last lexical form sent by this rule, in
5551 the case of \emph{vaig trobar a faltar}, would be, in the data stream:
5552 \begin{alltt} ^trobar<vblex><inf># a faltar\$ \end{alltt}
5554 \noindent The number sign \texttt{\#} in the data stream corresponds
5555 to the \texttt{<\textbf{g}>} element in dictionaries, used to signal
5556 the position of the invariable part in a split lemma multiword.
5558 It is important to note that the attributes included in
5559 \texttt{<\textbf{lu}>} may be empty. So, a verb matched by the rule in
5560 Fig. \ref{fig:case} which is not a split lemma multiword, will be sent
5561 with an empty \texttt{lemq} attribute, since the verb does not have
5562 lemma queue. This way it is not necessary to define different rules
5563 for lexical forms with and without queue. You can find another example
5564 of this in page \pageref{regla_verbo1}, where the rule for verb sends
5565 in a \texttt{<\textbf{lu}>} the attributes \texttt{gen}
5566 (\emph{gender}) and \texttt{nbr} (\emph{number}). This way, it
5567 includes participles (with gender and number) and the rest of verb
5568 forms (which will have these attributes empty).
5570 In the same page you can see a rule for a verb followed by an enclitic
5571 pronoun. Here, the lemma queue is placed after the enclitic pronoun;
5572 so, for a split lemma multiword joined to an enclitic pronoun, such as
5573 \emph{echándote de menos}, the output in the data stream would be,
5574 when translating into Catalan:
5576 \begin{alltt} ^trobar<vblex><ger>+et<prn><enc><p2><mf><sg># a faltar\$
5577 \end{alltt}
5579 Of course, this rule works also for verbs which do not belong to this
5580 multiword type; so, the form \emph{explicándote} ("explaining to you")
5581 would be output, when translating from Spanish to Catalan:
5584 \begin{alltt} ^explicar<vblex><ger>+et<prn><enc><p2><mf><sg>\$
5585 \end{alltt}
5587 As for the attribute \texttt{whole} of a \texttt{<\textbf{clip}>}, it
5588 must be taken into account that it can be used to send the whole
5589 lexical form only in the case that the sent word can not be a
5590 multiword, that is, can not contain a split lemma. Compare figures
5591 \ref{fig:case} and \ref{fig:regla2}. The \texttt{whole} attribute can
5592 be used in the second example because it contains the lemma
5593 \texttt{lem} plus all the morphological tags of the lexical forms in
5594 position 1 and 2 (determiner and noun). \nota{but nouns can also be mw now!}Contrarily, in the first
5595 example, the lexical form in \texttt{<\textbf{lu}>} is sent in parts,
5596 with a \texttt{lemh} (lemma head) and a \texttt{lemq} (lemma queue),
5597 since it may occur that the verb matched in the pattern is a multiword
5598 with split lemma. In practice, in our system this means that the
5599 \texttt{whole} attribute can be used to send any kind of lexical form
5600 except verbs and nouns, because we defined multiwords with inner
5601 inflection only for verbs and nouns.
5603 \subsubsection{Element for lexical unit \texttt{<mlu>}}
5604 \label{ss:mlu}
5606 Its name derives from \emph{multilexical unit}; it is used inside the
5607 \texttt{<\textbf{out}>} element to output multiwords consisting of
5608 more than one lexical form. Each lexical form in a
5609 \texttt{<\textbf{mlu}>} is sent inside a \texttt{<\textbf{lu}>}
5610 element. On the output of the module, lexical forms contained in this
5611 element will be joined to each other by the symbol '+' in the data
5612 stream. This means that they will become a multiword made of different
5613 lexical forms, which will be treated as a single unit by the
5614 subsequent modules; therefore, the generation dictionary will have to
5615 contain an entry for this multiword in order for it to be generated.
5617 In our system, this element is used to join enclitic pronouns to
5618 conjugated verbs.
5620 \subsubsection{Element for chunk encapsulation \texttt{<chunk>}}
5622 This is the element in which chunks are sent, in an
5623 \texttt{<\textbf{out}>} element, on the output of the module. It is
5624 only used in the \texttt{chunker} module in advanced mode, and in the
5625 \texttt{interchunk} module. It is not used in the \texttt{postchunk}
5626 module because its output does not contain any chunk. Neither it is
5627 used in the \texttt{chunker} module in shallow-transfer mode, because
5628 its output does not contain chunks but individual lexical units and
5629 blanks.
5631 \begin{enumerate}
5633 \item \textbf{Use in \texttt{chunker} in advanced mode}
5636 In this mode, the \texttt{<\textbf{chunk}>} element must have an
5637 attribute \texttt{name}, which is the lemma of the chunk, or an
5638 attribute \texttt{namefrom} which refers to a variable previously
5639 defined, whose value will be used as the lemma of the chunk. Besides,
5640 it can include the attribute \texttt{case} to specify which variable
5641 is the case policy taken from (for example, a value obtained with the
5642 instruction \texttt{<\textbf{case-from}>}).
5644 An example of its use can be found in Figure
5645 \ref{fig:chunkintrachunk}.
5648 \item \textbf{Use in \texttt{interchunk}}
5650 In this module, the \texttt{<\textbf{chunk}>} element does not
5651 specify any attribute; it is used just as the \texttt{<\textbf{lu}>}
5652 element is used in the shallow-transfer or in the \texttt{postchunk}
5653 to delimit the lexical forms. The elements it sends are (generally in
5654 a \texttt{<\textbf{clip}>} instruction): the lemma of the chunk
5655 (\texttt{lem}), its tags (\texttt{tags}) and the chunk content
5656 (\texttt{chcontent}, contains LF plus blanks), which is an invariable
5657 part since it can not be accessed from the \texttt{interchunk} module.
5658 The invariable part of the chunk is sent at the end. You can also use
5659 the \texttt{whole} attribute to send the whole chunk (lemma, tags and
5660 invariable content).
5662 An example of its use can be found in Figure
5663 \ref{fig:chunkinterchunk}.
5665 \end{enumerate}
5667 \subsubsection{Element for tag links section \texttt{<tags>}}
5669 \textit{Only in chunker in advanced mode}.
5671 This element is used to specify a list of tags, or
5672 \texttt{<\textbf{tag}>} elements, which will become the pseudotags of
5673 the chunk. It does not have attributes, and must be included as first
5674 item inside the \texttt{<\textbf{chunk}>} element. See Figure
5675 \ref{fig:chunkintrachunk}.
5678 \subsubsection{Element for tag link \texttt{<tag>}}
5680 \textit{Only in chunker in advanced mode}.
5682 The \texttt{<\textbf{tag}>} element must contain a morphological tag,
5683 which can be specified by means of a \texttt{<\textbf{clip}>}
5684 instruction or a literal tag \texttt{<\textbf{lit-tag}>}. It does not
5685 have attributes.
5687 The tag or tags specified this way in a chunk will become the
5688 grammatical symbols of the chunk; the next module,
5689 \texttt{interchunk}, will be able to use them to test and modify the
5690 characteristics of the chunks.
5693 \subsubsection{Element for blank \texttt{<b>}}
5695 The \texttt{<\textbf{b}>} element refers to [super]blanks and is
5696 indexed by the attribute \texttt{pos}. For example, a
5697 \texttt{<\textbf{b}>} with \texttt{pos="2"} refers to the
5698 [super]blanks (including format data encapsulated by the de-formatter)
5699 between the 2nd SLLF and the 3rd SLLF. The explicit management of
5700 [super]blanks enables the correct placement of format when the result
5701 of the structural transfer has more or less elements than the
5702 original, or when it has been reordered in some way.
5705 \subsection{Specification of the three modules that build an advanced
5706 transfer system}
5707 \label{noutransfer}
5709 In the following lines we describe the differences between the rule
5710 format in the three modules of an advanced transfer system. When
5711 Apertium works as a shallow-transfer system, the only module to be run
5712 is the first one, called \texttt{chunker}, which communicates directly
5713 with the generation module.
5716 \subsubsection{\texttt{Chunker} module}
5717 \label{ss:chunker}
5720 This module can be used alone as a shallow-transfer system, or in
5721 combination with the other two transfer modules to build an advanced
5722 transfer system. An attribute of the \texttt{<transfer>} element
5723 controls its run mode.
5725 \paragraph{Input/output}
5727 \begin{itemize}
5728 \item Input: data in the \texttt{pretransfer} output format, that is,
5729 with invariable queues of multiwords moved to the position right
5730 before the first grammatical symbol.
5732 \item Output:
5733 \begin{itemize}
5734 \item[-] in advanced mode (in an advanced transfer system): chunks,
5735 that will be detected and processed by the next module
5736 \item[-] in shallow-transfer mode (in a shallow-transfer system):
5737 lexical forms, that will be the input of the generation module.
5738 \end{itemize}
5740 \end{itemize}
5743 \paragraph{Data files}
5745 \nota{Explicar millor això de l'únic fitxer de configuració}
5747 This program uses a single configuration file and a precompiled file
5748 for pattern detection calculated from the former. The name of the
5749 pattern file (the configuration file) will have the extension
5750 \texttt{.t1x}. Since the \texttt{chunker} is the program that looks
5751 up the bilingual dictionary, this dictionary (compiled) also has to be
5752 provided to the program.
5754 \nota{Potser seria bona idea esmentar en quina secció s'explica el
5755 compilador a què es fa referència}
5757 The DTD of this data file is specified in Appendix
5758 \ref{ss:dtdtransfer}, and the elements used to create the rules in the
5759 file are described in Section \ref{formatotransfer}.
5761 \paragraph{Pattern matching}
5763 The rule matching system in this module will be the one described in
5764 \ref{functransfer}, since it is the same in advanced transfer mode and
5765 in shallow-transfer mode. The \texttt{a\-per\-tium-pre\-trans\-fer}
5766 program \nota{Vacil·lació terminològica \texttt{pretransfer}.} is
5767 needed to adapt the tagger output format to the input format required
5768 by the transfer module. There is the possibility that, in later
5769 versions of Apertium, the \textit{part-of-speech tagger} is modified
5770 so that it does the work of \texttt{apertium-pretransfer}.
5771 \nota{També hem d'unificar la terminologia d'altres mòduls:
5772 \emph{desambiguador categorial}, \emph{etiquetador}; tal com està
5773 redactat el paràgraf es podria pensar que són dues coses diferents.}
5776 \paragraph{How it works}
5778 The module works similarly in shallow-transfer mode and in advanced
5779 mode, with these differences:
5781 \begin{itemize}
5782 \item If we want that the module works as the first module in an
5783 advanced transfer system, we must specify the value \texttt{chunk} in
5784 the optional attribute \texttt{default} of the root element
5785 \texttt{<transfer>}. The default value is \texttt{lu}, which implies
5786 that the \texttt{chunker} works in shallow-transfer mode (as a single
5787 module).
5789 \item Chunk generation in the output: the \texttt{<chunk>} tag is an
5790 element one level higher than \texttt{<lu>} (\textit{lexical unit}),
5791 which generates chunks with the characteristics described in
5792 \ref{sec:format}; it has the following attributes:
5794 \begin{itemize}
5795 \item \texttt{name} (optional): pseudolemma of the chunk. It
5796 contains a string that is identified as the pseudolemma of the
5797 chunk.
5799 \item \texttt{namefrom} (optional): pseudolemma of the chunk,
5800 obtained from a variable. It is compulsory to specify whether
5801 \texttt{name} or \texttt{namefrom}.
5803 \item \texttt{case} (optional): variable that is used to obtain the
5804 information on case from it and apply it to the lemma specified in
5805 \texttt{name} or in \texttt{namefrom}.
5806 \end{itemize}
5808 \item Each chunk begins with a \texttt{<tags>} instruction, which does
5809 not allow any attribute, and which contains one or more individual
5810 instructions \texttt{<tag>}.
5811 \item Instructions \texttt{<tag>} do not have attributes. They can
5812 include any instruction that returns a string as a value:
5813 \texttt{<lit>}, \texttt{<var>} \nota{clip, lit-tag}.
5814 \item Instructions \texttt{<clip>} have an optional attribute:
5815 \texttt{link-to}, which is used to specify a tag \verb!<!\textit{value
5816 of link-to}\verb!>! that replaces \nota{Spanish: ``una etiqueta en
5817 lugar de'' (instead of) or ``additionally''?. Explain new aspects of
5818 link-to} the information specified by the \texttt{<clip>} in the rest
5819 of its attributes.\nota{No s'entén gaire bé - Not understandable} This
5820 information is dispensable but can be useful as information on the
5821 origin of the linguistic decision.
5822 \end{itemize}
5824 The following is a use example of the \texttt{<chunk>} element :
5826 \begin{alltt}
5827 <out>
5828 <chunk name="adj-noun" case="variableCase">
5829 <tags>
5830 <tag><lit-tag v="NP"/></tag>
5831 <tag><clip pos="2" side="tl" part="gen"/></tag>
5832 <tag><clip pos="2" side="tl" part="nbr"/></tag>
5833 </tags>
5834 <lu>
5835 <clip pos="2" side="tl" part="lemh"/>
5836 <clip pos="2" side="tl" part="a_noun"/>
5837 <clip pos="2" side="tl" part="gen" link-to="2"/>
5838 <clip pos="2" side="tl" part="nbr" link-to="3"/>
5839 </lu>
5840 <b pos="1"/>
5841 <lu>
5842 <var n="adjectiu"/>
5843 <clip pos="1" side="tl" part="lem"/>
5844 <clip pos="1" side="tl" part="a_adj"/>
5845 <clip pos="2" side="tl" part="gen" link-to="2"/>
5846 <clip pos="2" side="tl" part="nbr" link-to="3"/>
5847 </lu>
5848 </chunk>
5849 </out>
5850 \end{alltt}
5853 \paragraph{Default action}
5855 Isolated \textit{superblanks} which are not detected by any pattern in
5856 this module, are written in the same order in which they arrive.
5858 The default action for words not matched by any pattern
5859 is different depending on the transfer mode (that is, on the value of the
5860 optional attribute \texttt{default} of the root element \texttt{<transfer>}):
5862 \begin{itemize}
5863 \item if the value is \texttt{chunk} (i.e. the module works in advanced
5864 mode): it will generate trivial chunks with the words not matched by
5865 any rule, so that in the output there are no words not included in a
5866 chunk. The new chunk will be created with the translation of the
5867 word by the bilingual dictionary. The fixed lemma of these
5868 implicitly created chunks is \texttt{default}.
5869 \item if the value is \texttt{lu} (default value; i.e. the module works as single
5870 module in a shallow-transfer system): it will not create chunks for
5871 words not matched by rules, they will just be translated using the
5872 bilingual dictionary.
5874 \end{itemize}
5876 The following is an automatically generated chunk for a lexical form
5877 not matched by any rule in the \texttt{chunker} module when the
5878 \texttt{default} attribute has the value \texttt{chunk}:
5881 \begin{alltt}
5882 ^default\verb!{!^that<cnjsub>$\verb!}!$
5883 \end{alltt}
5885 \nota{Va sense etiquetes entre \texttt{default} i \texttt{\{}? No
5886 caldria dir-ho explícitament?}
5889 \subsubsection{\texttt{Interchunk} module}
5890 \label{ss:interchunk}
5893 \nota{\texttt{apertium-interchunk} or simply \texttt{interchunk}?}
5895 The \texttt{interchunk} module processes chunks; it may reorder them
5896 and change its morphosyntactic information. This is done by detecting
5897 patterns of chunks (sequences of chunks). The instructions that
5898 control how it works are, with little differences, the same used by
5899 the \texttt{chunker} module; they are written, however, in a different
5900 file. Chunks are processed here in a similar way as words are
5901 processed in the \texttt{chunker} of Apertium. \nota{Comprovar la
5902 denominació dels programes}
5904 \paragraph{Input/output}
5906 \begin{itemize}
5907 \item Input: chunks from the \texttt{chunker}.
5908 \item Output: chunks possibly reordered and with the data on its
5909 pseudolemmas (lexical pseudoforms) possibly changed.
5910 \end{itemize}
5912 \paragraph{Data files}
5914 This module uses two data files. A specification file of the
5915 \texttt{in\-ter\-chunk} program, with extension \texttt{.t2x} by
5916 analogy with the previous module, and a file of precalculated patterns
5917 to accelerate the analysis of the input. The binary file of the
5918 bilingual dictionary is not included because it is not used.
5919 \nota{Citar el compilador?}
5921 The syntax of the specification file is very similar to that of the
5922 \texttt{chunker}. Its DTD is specified in Appendix
5923 \ref{ss:dtdinterchunk}, and the elements used to create the rules in
5924 the file are described in Section \ref{formatotransfer}.
5927 \paragraph{Pattern matching}
5929 Rules detect patterns defined by sequences of lexical
5930 pseudoforms. These lexical pseudoforms have a format based on the
5931 format of lexical forms for words. In practice, a lexical pseudoform
5932 is seen equivalently as \nota{mlforcada: La alternança
5933 \emph{pseudolema} i \emph{pseudoparaula} s'ha de resoldre. MG: ho he
5934 traduit tot com a 'lexical pseudoform', crec que era aquest el
5935 sentit.} lexical forms are seen in the \texttt{chunker} regarding
5936 pattern matching. This way, pattern matching will be based on
5937 attributes defined for lexical pseudoforms, not for lexical forms
5938 (words) of the original pattern.
5940 \paragraph{How it works}
5942 With regard to the set of instructions used in \texttt{chunker}, the
5943 changes on the set of instructions for this module are the following:
5945 \begin{itemize}
5946 \item The root element is called \texttt{<interchunk>} and does not
5947 have any attribute.
5948 \item The attribute \texttt{side} disappears: This module does not use
5949 bilingual dictionaries; therefore, the attribute used to indicate
5950 whether the consulted side is SL or TL looses sense. This attribute
5951 was basically used in the \texttt{<out>} instructions.
5952 \item The \texttt{<chunk>} tag is used here without attributes, simply
5953 inside an \texttt{<out>} to delimit the output of chunks.
5954 \item The predefined attribute \texttt{lem} refers to the pseudolemma
5955 of the chunk. In the same way, the predefined attribute \texttt{tags}
5956 refers to the grammatical symbols or tags of the chunk. The chunk
5957 content becomes something like a queue which can be printed with the
5958 implicit attribute \texttt{chcontent}.\nota{Només imprimir o s'hi pot
5959 fer referència també?} \nota{Dir de quin element són aquests
5960 atributs}
5961 \item All the values of \texttt{part}, except \texttt{chname}, access
5962 the pseudolemma and the tags of the chunk (not of individual words).
5963 \item Unlike what happens in the \texttt{chunker} module, in the rules
5964 of this module it is not allowed to print anything else than
5965 \texttt{<chunk>}s in the \texttt{<out>} instructions, in no case
5966 isolated words.\nota{MG: and blanks too, right?}
5967 \end{itemize}
5970 \paragraph{Default action}
5972 Like in the previous module, a default action has been defined, which
5973 writes without modifications the chunks not matched by any pattern of
5974 the specification file. This default action writes exactly what it
5975 reads, be it chunks or blanks. \nota{Atenció a la vacil·lació
5976 \emph{regla}/\emph{acció} en la resta del document. Sempre havia
5977 cregut que era \emph{regla}=\emph{patró}+\emph{acció}.}
5980 \subsubsection{\texttt{Postchunk} module}
5981 \label{ss:postchunk}
5983 The \texttt{postchunk} module detects single chunks and, for each of
5984 them, performs the specified actions. Detection is based on the lemma
5985 of the chunk, and not in patterns (not in tags); this causes detection
5986 in this module to be done specific for each ``name'' of
5987 chunk.\nota{Quan fixem bé la terminologia hem d'assegurar-nos que la
5988 redacció d'aquesta part és l'adequada.}
5991 On the other hand, detection and processing in rules is based on the
5992 fact that references to parameters are solved right after detection,
5993 that is, the tags \texttt{<1>}, \texttt{<2>}, etc. are automatically
5994 replaced by the value of the parameters before the processing
5995 begins. Positions (attribute \texttt{pos}) specified in instructions
5996 such as \texttt{<clip>}, refer to the position of the words inside the
5997 chunk.
5999 Also the case policy is automatically applied (see Section
6000 \ref{ss:majuscules}) from the pseudolemma of the chunk to the words
6001 inside the chunk.
6005 \paragraph{Input/output}
6007 \begin{itemize}
6008 \item Input: chunks from the \texttt{in\-ter\-chunk}.
6009 \item Output: valid input for the morphological generator of Apertium.
6010 \end{itemize}
6012 \paragraph{Data files}
6014 This program has its own specification file, which will have the
6015 extension \texttt{.t3x}. Its syntax is based as well on the
6016 \texttt{chunker} and the \texttt{in\-ter\-chunk}. \nota{Explicar que
6017 no ha de llegir cap fitxer compilat de patrons perquè usa noms i no
6018 patrons?}
6020 \paragraph{Pattern matching}
6022 Chunk matching is based on the name of the chunk. Unmatched chunks
6023 receive the default processing.
6025 \paragraph{How it works}
6027 The differences with regard to the \texttt{in\-ter\-chunk} module are
6028 the following:
6030 \begin{itemize}
6031 \item It is not allowed to write chunks (\texttt{<chunk>}) in the
6032 output: only lexical units (\texttt{<lu>} or \texttt{<mlu>}) and
6033 blanks can be written. \nota{Comprovar aquest ítem perquè era
6034 incomplet i l'ha completat mlf}
6035 \item New detection attribute \texttt{name} in \texttt{<cat-item>},
6036 which is used in the \texttt{<pattern>} part of rules isolatedly, to
6037 force pattern detection basing on its name. \nota{mlf: Què vol dir
6038 ``de manera aïllada''? Sembla que vulga dir ``de tant en tant''. MG:
6039 the attribute 'name' is used in the pattern part of rules? is this
6040 correct?}
6041 \item Also the attribute \texttt{side} is not used here, as in the
6042 \texttt{in\-ter\-chunk}, for the same reason: the bilingual dictionary
6043 is not looked up. \nota{MG: però llavors això no és una diferència
6044 respecte de \texttt{interchunk} no?}
6045 \end{itemize}
6047 \paragraph{Default action}
6050 In this module, the default action is to write the words contained in
6051 the chunks, replacing the references with the parameters of the
6052 chunk. It will be applied to most chunks, since it is foreseen that
6053 this module performs non-default actions only for specific cases
6054 requiring some special processing.
6056 Also the case policy is applied by default (see Section
6057 \ref{ss:majuscules}).
6059 In any case, blanks outside chunks are copied in the same order as are
6060 read, since chunk matching is done individually (this module does
6061 not group chunks).
6066 \subsection{Preprocessing of the structural transfer module}
6067 \label{ss:preproceso_transfer}
6069 Specification files for the structural transfer modules, also called
6070 \emph{transfer rules files}, are pre-processed by the program
6071 \textit{apertium-preprocess-transfer}, which calculates the patterns
6072 to match rules preconditions, and indexes the rules to speed up its
6073 processsing during execution time. This information is saved in a
6074 binary file which is read together with the bilingual dictionary and
6075 the rules file itself, because the structural transfer and lexical
6076 transfer modules are executed together.
6079 \section{De-formatter and re-formatter}
6080 \label{se:desformat}
6083 \subsection{Format processing}
6084 \label{ss:formato}
6086 This section describes how the de-formatter and re-formatter process
6087 the format of the documents. These two modules are created from a set
6088 of format specification rules in XML, which are described in Section
6089 \ref{ss:reglasformato}.
6092 Apertium can process documents in XML, HTML, RTF and plain text. For
6093 all these document types, format is \textit{encapsulated} as explained
6094 in the following lines.
6096 Text strings that are identified as part of the format ---from now on
6097 referred to as \textit{blocks of format} or \textit{superblanks}---
6098 are encapsulated between delimiters that depend on the specification
6099 of the data flow between modules (which is described in detail in
6100 Section~\ref{se:flujodatos}); so, in the flow format (sections
6101 \ref{se:noxml1} and \ref{se:noxml2}), \emph{superblanks} are put
6102 between brackets '\texttt{[}' and '\texttt{]}'. Each of these
6103 encapsulated strings will be treated as it were a blank
6104 \texttt{<\textbf{b}/>} (page~\pageref{s3:b}) ---that is why they are
6105 called \textit{superblanks}--- and will be restored in the correct
6106 order in the translator's output.
6108 As has been explained in Section \ref{se:flujodatos}, when the blocks
6109 of format are large (as is sometimes the case in HTML with Javascript
6110 code fragments, or in RTF with bitmap images), these blocks will be
6111 saved as temporary files so that they can be removed from the data
6112 flow of the translation.
6114 Sometimes, the format in a document can implicitly indicate the
6115 division of the text into sentences (see page \pageref{finfrase} in
6116 Section \ref{se:flujodatos}). For example, section or document titles
6117 can be a sentence without full stop. If we know that a format mark is
6118 indicating this division, we have to take advantage of this
6119 information in order to do a better translation. Some examples of
6120 format that give us data about the end of a sentence are: two
6121 consecutive line breaks in plain text format, a \texttt{</h1>} tag in
6122 HTML, etc. The de-formatter generates in such cases a mark of sentence
6123 end that is equivalent to a full stop.
6125 \subsubsection{Format encapsulation method}
6127 The types of blocks of format or \emph{superblanks} that can be
6128 generated as a result of the format processing are the following:
6130 \begin{itemize}
6131 \item \textit{Non-empty blocks of format or superblanks}. They
6132 contain exclusively format marks of the source document. In the data
6133 flow described in Section~\ref{se:flujodatos} , they begin with a left
6134 square bracket '\texttt{[}' and end with a right square bracket
6135 '\texttt{]}'.
6136 \item \textit{Blocks of format with reference to an external file} or
6137 \textit{extensive superblanks}. They encapsulate long format fragments
6138 in a way that improves the translator's performance. In the data flow
6139 described in Section~\ref{se:flujodatos}, they begin with the
6140 characters '\texttt{[@}', then there is the name of the file where the
6141 format fragment extracted from the source text is saved, and finally
6142 they end with a right square bracket '\texttt{]}'.
6143 \item \textit{Empty blocks of format}. They contain artificial
6144 information on text division obtained from the format data. Before
6145 the empty block of format, the system places the appropriate
6146 artificial punctuation mark. When the original format is restored in
6147 the document at the end of the process, the presence of a block of
6148 format like this will cause the deletion of the character right before
6149 the block in the data flow.
6150 \end{itemize}
6152 %% [movido al apéndice]
6153 %% Dentro de los bloques de formato, los caracteres '\texttt{[}', '\texttt{]}',
6154 %% '\texttt{@}' y '\verb!\!' se escapan mediante las secuencias de escape
6155 %% '\verb!\[!', '\verb!\]!', '\verb!\@!' y '\verb!\\!', respectivamente. Esto
6156 %% hay que tenerlo en cuenta para encapsular y desencapsular. En el exterior de
6157 %% los bloques de formato es necesario también escapar los corchetes de apertura
6158 %% y cierre.
6160 The general criteria applied to the creation of blocks of format are
6161 the following:
6163 \label{pg:criteri}
6164 \begin{itemize}
6165 \item Everything that is considered not to be part of the text to be
6166 translated, has to be encapsulated in blocks of format.
6167 \item There can not be two or more strictly consecutive non-empty
6168 blocks of format. Two consecutive blocks of format must be merged
6169 into a single block.
6170 \item Empty blocks of format must precede a non-empty block of format
6171 or the end of the file.
6172 \end{itemize}
6174 Figure~\ref{fg:ejemplopelado} shows an example document the format of
6175 which must be processed before translation; the encapsulation
6176 corresponds to the flow format not based on
6177 XML. Figure~\ref{fg:ejemploencapsulado} displays the result of
6178 processing the mentioned document.
6182 \begin{figure}[htbp]
6183 \begin{small}
6184 \begin{alltt}
6185 <html>
6186 <head>
6187 <title>This is the title</title>
6188 <script>
6189 <!-- ... (an extensive code block) -->
6190 </script>
6191 </head>
6192 <body>
6193 <p>This
6194 is a paragraph in two lines</p>
6195 </body>
6196 </html>
6197 \end{alltt}
6198 \end{small}
6199 \caption{Example of HTML document}
6200 \label{fg:ejemplopelado}
6201 \end{figure}
6203 \begin{figure}[htbp]
6204 \begin{small}
6205 \begin{alltt}
6206 \textbf{[<html>
6207 <head>
6208 <title>]}This is the title\textbf{.[][@/tmp/temp35345]}This\textbf{[
6209 ]}is a paragraph in two lines\textbf{.[][</p>
6210 </body>
6211 </html>]}
6212 \end{alltt}
6213 \end{small}
6214 \caption{Example of HTML document where the blocks of format have been
6215 encapsulated by the de-formatter}\nota{repeteix coses capítol format
6216 -revisar -Gema}
6217 \label{fg:ejemploencapsulado}
6218 \end{figure}
6220 We would like to emphasize the following from this example:
6221 \begin{itemize}
6222 \item The system does not generate consecutive blocks of format with
6223 content (non-empty).
6224 \item Tags like \texttt{</\textbf{title}>} or \texttt{</\textbf{p}>}
6225 cause the insertion of an artificial punctuation mark; this insertion
6226 is done systematically, even when it is not necessary, because it does
6227 not interfere and is efficient.
6228 \item Extensive superblanks are literally removed from the translation
6229 process. In this case, the temporary file \texttt{temp35345} contains
6230 the tags from \texttt{</\textbf{title}>} to \texttt{<\textbf{p}>}
6231 \item Simple blanks between words are not encapsulated. But the
6232 system does encapsulate multiple blanks (two or more consecutive
6233 blanks), tabs, etc. Also line breaks are encapsulated.
6234 \end{itemize}
6241 \subsection{Data: format specification rules}
6242 \label{ss:reglasformato} This section describes how the de-formatter
6243 and re-formatter are generated from a format specification in XML.
6246 Rules for format, like linguistic data, are specified in XML, and they
6247 contain regular expressions with \texttt{flex} syntax. The
6248 specification is divided in three parts (see its DTD in the Appendix
6249 \ref{ss:dtd_formato}):
6251 \begin{itemize}
6252 \item \textbf{Configuration options}. Here one specifies the value for
6253 the maximum length of a non-extensive superblank, the input and output
6254 encodings, whether case must be considered, and the regular expressions for
6255 escape characters and space characters.
6257 \item \textbf{Format rules}. Describes the set of tags belonging to a
6258 specific format which have to be included in a block of format by the
6259 de-formatter. These tags may, optionally, indicate a sentence end, in which case
6260 the de-formatter will insert an artificial punctuation mark (followed
6261 by an empty block of format, as explained in the previous
6262 section). One has to specify the priority of application of the rules,
6263 although, when this is not relevant, it is possible to give the same
6264 priority to all the rules by assigning them the same value (any
6265 number).
6267 Everything that is not specified as format will be left without
6268 encapsulation and, therefore, will be considered as translatable
6269 text.
6271 \item \textbf{Replacement rules}. Allow to replace special characters
6272 in the text. A regular expression will recognize \nota{MG: HELP: in
6273 Spanish, "recogerá", I don't know how to translate this:
6274 include/detect/group/recognize???} a set of special characters, and
6275 will replace it with the specified characters. For example, in HTML,
6276 the characters specified in hexadecimal have to be replaced with the
6277 corresponding entity or ASCII character. For example,
6278 \texttt{cami\&oacute;n} corresponds to \texttt{camión}.
6279 \end{itemize}
6281 Rules are described in more detail next.
6282 \begin{itemize}
6283 \item Root of the specification file. The attribute \texttt{name}
6284 contains the name of the format.
6285 \begin{small}
6286 \begin{alltt}
6287 <?xml version="1.0" encoding="ISO-8859-1"?>
6288 <format name="html">
6289 <options>
6290 ...
6291 </options>
6293 <rules>
6294 ...
6295 </rules>
6296 </format>
6297 \end{alltt}
6298 \end{small}
6300 \end{itemize}
6302 It has to include the options and rules, an example of which is
6303 presented next:
6305 \begin{itemize}
6307 \item Options.
6308 \begin{small}
6309 \begin{alltt}
6310 <options>
6311 <largeblocks size="8192"/>
6312 <input encoding="ISO-8859-1"/>
6313 <output encoding="ISO-8859-1"/>
6314 <escape-chars regexp='[\verb!\![\verb!\!]^\$\verb!\!\verb!\!]'/>
6315 <space-chars regexp='[ \verb!\!n\verb!\!t\verb!\!r]'/>
6316 <case-sensitive value="no"/>
6317 </options>
6318 \end{alltt}
6319 \end{small}
6321 \end{itemize}
6323 The element \texttt{<largeblocks>} specifies the maximum length of a
6324 non-extensive superblank, through the value of the attribute
6325 \texttt{size}. The elements \texttt{<input>} and \texttt{<output>}
6326 specify the input and output encoding of the text, through the
6327 attribute \texttt{encoding}.
6329 The element \texttt{escape-chars} specifies, by means of a regular
6330 expression declared in the value of the attribute \texttt{regexp},
6331 which characters must be escaped with a backslash. The element
6332 \texttt{<space-chars>} specifies the set of characters that must be
6333 considered as blanks.
6335 Finally, the element \texttt{case-sensitive} specifies if case is
6336 relevant in the specifications of format attributes in which regular
6337 expressions are contained.
6340 \begin{itemize}
6341 \item Rules. There are format rules and replacement rules.
6342 \begin{small}
6343 \begin{alltt}
6344 <rules>
6345 <format-rule ... >
6347 </format-rule>
6350 <replacement-rule>
6352 </replacement-rule>
6354 </rules>
6355 \end{alltt}
6356 \end{small} The two types are described in the following points.
6358 \item Format rules. The de-formatter will encapsulate in blocks of
6359 format the tags indicated by these rules in the field
6360 \texttt{regexp}. If they are begin and end tags, and everything
6361 delimited by them is format, one has to specify a \texttt{regexp} both
6362 for \texttt{begin} and for \texttt{end}:
6363 \begin{small}
6364 \begin{alltt}
6365 <format-rule eos="no" priority="1">
6366 <begin regexp='"\verb!\!\&lt;!--"'/>
6367 <end regexp='"--\verb!\!\&gt;"'/>
6368 </format-rule>
6369 \end{alltt}
6370 \end{small} Otherwise only one \texttt{begin-end} element is used:
6371 \begin{small}
6372 \begin{alltt}
6373 <format-rule eos="yes" priority="3">
6374 <begin-end regexp='"\&lt;"[/]?"li"[^\&gt;]*"\&gt;"'/>
6375 </format-rule>
6376 \end{alltt}
6377 \end{small}
6380 Besides, in \texttt{priority} you have to specify a priority to tell
6381 the system in which order the format rules must be applied (the
6382 absolute value is not relevant, only the order resulting from the
6383 values). In ``\texttt{eos}'' you indicate, with \texttt{yes} or
6384 \texttt{no}, whether the block of format that contains the detected
6385 pattern must be preceded by an artificial punctuation mark or
6386 not.\footnote{In all these cases, the typical entities \texttt{\&lt;}
6387 and \texttt{\&gt;} are used to represent the characters \texttt{<} and
6388 \texttt{>} respectively.}
6390 \item Replacement rules. Are used to replace special characters in the
6391 text. The regular expression in the attribute \texttt{regexp} will
6392 recognize \nota{idem: help in translation of "recogerá"} a set of
6393 special characters and will replace them with the specified characters
6394 in the text to be translated. The correspondence between original and
6395 replacement characters is stated in the attributes \texttt{source} and
6396 \texttt{target} of the \texttt{replace} elements, which can be
6397 multiple:
6398 \begin{small}
6399 \begin{alltt}
6400 <replacement-rule regexp='"\&amp;"[^;]+;'>
6401 <replace source="\&amp;Agrave;" target="À"/>
6402 <replace source="\&amp;#192;" target="À"/>
6403 <replace source="\&amp;#xC0;" target="À"/>
6404 <replace source="\&amp;#xc0;" target="À"/>
6405 <replace source="\&amp;Aacute;" target="Á"/>
6406 <replace source="\&amp;#193;" target="Á"/>
6407 <replace source="\&amp;#xC1;" target="Á"/>
6408 <replace source="\&amp;#xc1;" target="Á"/>
6410 </replacement-rule>
6411 \end{alltt}
6412 \end{small}
6413 \item Regular expressions of \texttt{regexp} attributes. They have the
6414 syntax used in \texttt{flex} \cite{lesk75tr}.
6416 \end{itemize}
6418 % DTD moguda a Apèndix
6421 As example of a format specification, we will give that for HTML. The
6422 explanation given in the following paragraphs can be followed looking
6423 at Figure \ref{fg:formato-html}.
6426 In the first place, we find the format rule that specifies in a
6427 general way all the HTML tags: it considers as HTML tag everything
6428 that begins with the sign \textbf{\texttt{<}} and ends with the sign
6429 \textbf{\texttt{>}}. This rule has the lowest priority (4) so that the
6430 more specific rules are applied preferentially. But before
6431 considering a tag in a general way by applying this rule, some of the
6432 higher priority rules will be applied. In the case of HTML, the
6433 highest priority is for comments \texttt{<!-- ... -->}. The marks for
6434 beginning and end \texttt{<script> </script>} and \texttt{<style>
6435 </style>}, where everything included by them is considered to be
6436 format, has priority 2. Priority 3 is for tags that indicate end of
6437 sentence (artificial punctuation), which are \texttt{</br>},
6438 \texttt{</hr>}, \texttt{</p>}, etc.
6440 Last of all are the replacement rules, which replace all the codes
6441 that begin with \texttt{\&}, as specified in the regular
6442 expression. Then, each one of the replacements is defined:
6443 \texttt{\&Agrave}, as well as \texttt{\&\#192}, \texttt{\&\#xC0} and
6444 \texttt{\&\#xc0} are replaced with \texttt{À}. The remaining special
6445 characters are declared in the same way.
6449 \begin{figure}[htbp]
6450 \begin{small}
6451 \begin{alltt}
6452 <?xml version="1.0" encoding="ISO-8859-1"?>
6453 <format name="html">
6454 <options>
6455 <largeblocks size="8192"/>
6456 <input encoding="ISO-8859-1"/>
6457 <output encoding="ISO-8859-1"/>
6458 <escape-chars regexp='[\verb!\![\verb!\!]^\$\verb!\!\verb!\!]'/>
6459 <space-chars regexp='[ \verb!\! n\verb!\! t\verb!\! r]'/>
6460 <case-sensitive value="no"/>
6461 </options>
6463 <rules>
6464 <format-rule eos="no" priority="1">
6465 <begin regexp='"\&lt;!--"'/>
6466 <end regexp='"--\&gt;"'/>
6467 </format-rule>
6469 <format-rule eos="no" priority="2">
6470 <begin regexp='"\&lt;script"[^\&gt;]*"\&gt;"'/>
6471 <end regexp='"\&lt;/script"[^\&gt;]*"\&gt;"'/>
6472 </format-rule>
6473 <format-rule eos="no" priority="2">
6474 <begin regexp='"\&lt;style"[^\&gt;]*"\&gt;"'/>
6475 <end regexp='"\&lt;/style"[^\&gt;]*"\&gt;"'/>
6476 </format-rule>
6478 <format-rule eos="yes" priority="3">
6479 <begin-end regexp='"\&lt;"[/]?"br"[^\&gt;]*"\&gt;"'/>
6480 </format-rule>
6481 <!-- Here come more declarations of format-rule eos="yes"-->
6482 <!-- ... -->
6484 <format-rule eos="no" priority="4">
6485 <begin-end regexp='"\&lt;"[a-zA-Z][^\&gt;]*"\&gt;"'/>
6486 </format-rule>
6488 <replacement-rule regexp='"\&amp;"[^;]+;'>
6489 <replace source="\&amp;Agrave;" target="À"/>
6490 <replace source="\&amp;#192;" target="À"/>
6491 <replace source="\&amp;#xC0;" target="À"/>
6492 <replace source="\&amp;#xc0;" target="À"/>
6493 <!-- Here come more replace elements -->
6494 <!-- ... -->
6495 </replacement-rule>
6496 </rules>
6497 </format>
6498 \end{alltt}
6499 \end{small}
6500 \caption{Part of the rules definition for HTML format}
6501 \label{fg:formato-html}
6502 \end{figure}
6505 \subsection{Generation of de-formatters and re-formatters}
6506 \label{se:gendeformat}
6508 To generate the de-formatter and re-formatter for a given format, the
6509 XML rules that declare the format are applied a style sheet that
6510 carries out the generation. This XSLT transformation produces a
6511 \texttt{lex} \cite{lesk75tr} file that, once compiled, is the
6512 executable of the de-formatter and the re-formatter for the specified
6513 format.
6515 Thanks to the general specification of formats described in this
6516 chapter, it has been possible to define specifications for HTML, RTF
6517 and plain text. These specifications are in the \texttt{apertium}
6518 package, in the respective files \texttt{html-format.xml},
6519 \texttt{rtf-format.xml}, \texttt{txt-format.xml}. In particular, it
6520 is quite simple to define de-formatters and re-formatters for any XML
6521 format.
6523 \chapter{Installing and running the system}
6524 \label{se:instalacion}
6527 \section{System requirements}
6529 The system where you want to install and run Apertium must have the
6530 following programs installed:
6532 \begin{itemize}
6533 \item \texttt{libxml2} version 2.6.17 or later (on Ubuntu you may need
6534 to install \texttt{libxml2-dev} too)
6536 \item \texttt{xmllint} tool (usually comes with \texttt{libxml2}, but
6537 may be an independent package on your system, i.e. Debian GNU-Linux)
6539 \item \texttt{xsltproc} tool (non-PowerPC users); also comes with
6540 \texttt{libxml2} but may also be an independent package in your
6541 system, as happens with the \texttt{xmllint} tool
6543 \item \texttt{sabcmd} tool (PowerPC users), provided by package
6544 \texttt{sablotron}
6546 \item flex 2.5.4 or earlier (in some distributions, flex-old package)
6547 \item GNU \texttt{make}, \texttt{gcc} (\texttt{g++}), \texttt{bash}
6548 shell
6550 \end{itemize}
6552 \section{Installing program packages}
6554 To install the Apertium machine translation system programs and
6555 libraries first you need to download (from
6556 \url{http://sourceforge.net/projects/apertium}), compile and install
6557 the latest version of the following packages, in the specified order:
6559 \begin{enumerate}
6560 \item \texttt{lttoolbox}
6561 \item \texttt{apertium}
6562 \end{enumerate}
6564 The simplest way to compile each package is:
6566 \begin{enumerate}
6567 \item Go to the directory containing the package's source code and
6568 type \texttt{./configure} to configure the package for your system.
6569 If you're using csh on an old version of System V, you might need to
6570 type \texttt{sh ./configure} instead to prevent \texttt{csh} (the
6571 default shell in old System V) from trying to execute
6572 \texttt{configure} itself. Running \texttt{configure} takes a
6573 while. While running, it prints some messages telling which features
6574 it is checking for.
6576 \item Type \texttt{make} to compile the package
6578 \item Type \texttt{make install} (possibly with root privileges) to
6579 install the programs and any data files and documentation.
6581 \item You can remove the program binaries and object files from the
6582 source code directory by typing \texttt{make clean}. To remove also
6583 the files that \texttt{configure} created (so you can compile the
6584 package for a different kind of computer), type \texttt{make
6585 distclean}. There is also a\\ \texttt{maintainer-clean} option in
6586 the Makefile, but that is intended mainly for the package's
6587 developers. If you use it, you may have to get all sorts of other
6588 programs in order to regenerate files that came with the
6589 distribution.
6590 \end{enumerate}
6592 If you don't have root privileges to install the programs in your
6593 system, you can use the \texttt{-prefix} flag with the configure
6594 script to install them at your user account. For example:
6596 \begin{small}
6597 \begin{alltt}
6598 \verb!$! pwd
6599 /home/me/lttoolbox-0.9.1
6600 \verb!$! ./configure --prefix=/home/me/myinstall
6601 \end{alltt}
6602 \end{small}
6604 Libraries will be installed in the \texttt{LIBDIR=\$prefix/lib}
6605 directory. If no \texttt{-prefix} flag is specified with configure
6606 script, LIBDIR will be \texttt{/usr/local/lib}.
6609 If you find some error to link against installed libraries in a given
6610 directory \texttt{LIBDIR}, you must either use libtool, and specify
6611 the full pathname of the library, or use the \texttt{LIBDIR} flag
6612 during linking and do at least one of the following:
6614 \begin{itemize}
6616 \item add \verb!LIBDIR! to the \verb!LD_LIBRARY_PATH! environment
6617 variable during execution
6619 \item add \verb!LIBDIR! to the \verb!LD_RUN_PATH! environment variable
6620 during linking
6622 \item use the \texttt{-Wl}, \texttt{--rpath -Wl}, \texttt{LIBDIR}
6623 linker flag
6625 \item have your system administrator add \texttt{LIBDIR} to
6626 \texttt{/etc/ld.so.conf} and run \texttt{ldconfig}
6628 \end{itemize}
6630 See any operating system documentation about shared libraries for more
6631 information, such as the \texttt{ld(1)} and \texttt{ld.so(8)} manual
6632 pages.
6634 \section{Installing data packages}
6636 To install the linguistic data packages, follow these steps:
6638 \begin{enumerate}
6640 \item Download a data package
6641 (\texttt{apertium-}$LANG_1$\texttt{-}$LANG_2$\texttt{-}$VERSION$\texttt{.tar.gz})
6642 from Apertium's website in Sourceforge
6643 (\url{http://apertium.sourceforge.net/}). For example, to get version
6644 0.9 of the linguistic data for the Spanish--Catalan translator, you
6645 need to download the package \texttt{apertium-es-ca-0.9.tar.gz}.
6647 \item Unpack the tarball in any directory, go to this directory and
6648 type \texttt{make} in the terminal. Wait while linguistic data are
6649 compiled.
6652 \end{enumerate}
6655 \section{Using the translator}
6657 There are Apertium versions that work both in Linux systems (always
6658 more up-to-date) and in Windows systems. The information in this
6659 section is intended for Linux users.
6662 To run the translator, you have to use the
6663 \texttt{apertium-translator} tool referring to the directory where
6664 linguistic data are saved, and specifying the translation direction
6665 (\texttt{es-ca}, \texttt{ca-es}, \texttt{es-gl}, etc.), the file
6666 format (\texttt{txt}, \texttt{html}, \texttt{rtf}), the name of the
6667 file to be translated and the name of the output file. So, the command
6668 structure is as follows:
6671 \begin{small}
6672 \begin{alltt}
6673 \$ apertium-translator <directory> <translation> <format> \\
6674 < input_file > output_file
6675 \end{alltt}
6676 \end{small}
6679 For example, if your directory is \texttt{/home/maria/apertium-es-ca},
6680 you have to type the following to translate a file in \texttt{txt}
6681 format from Spanish to Catalan:
6683 \begin{small}
6684 \begin{alltt}
6685 \$ apertium-translator /home/maria/apertium-es-ca es-ca \\txt <file_sp >file_ca
6686 \end{alltt}
6687 \end{small}
6689 It is recommended to go to the directory where linguistic data are
6690 saved, because this way you only need to type a dot to refer to the
6691 current directory:
6693 \begin{small}
6694 \begin{alltt}
6695 \$ apertium-translator . es-ca txt <file_sp >file_ca
6696 \end{alltt}
6697 \end{small}
6699 If no format is specified, the default format is \texttt{txt}. When
6700 working with the \texttt{txt}, \texttt{html} and \texttt{rtf} formats,
6701 unknown words are marked with an asterisk (*) and errors with a symbol
6702 (@, \# or /); if you wish that neither unknown words nor errors are
6703 marked, you have to add a \texttt{u} to the format name. Therefore,
6704 the format options are the following:
6706 \begin{itemize}
6707 \item \texttt{txt} : Default option, text with marks for unknown words
6708 and errors
6710 \item \texttt{txtu} : text without marks for unknown words and errors
6712 \item \texttt{html} : HTML with marks for unknown words and errors
6714 \item \texttt{htmlu} : HTML without marks for unknown words and errors
6716 \item \texttt{rtf} : RTF with marks for unknown words and errors
6718 \item \texttt{rtfu} : RTF without marks for unknown words and errors
6720 \end{itemize}
6722 If you do not wish to translate a file but just a sentence or a
6723 paragraph in the screen, you can run the \texttt{apertium-translator}
6724 tool without specifying any file name. The command, if you are in the
6725 directory where linguistic data are saved, would be the following:
6727 \begin{small}
6728 \begin{alltt}
6729 \$ apertium-translator . es-ca
6730 \end{alltt}
6731 \end{small}
6733 Then, you have to type or paste the text you wish to translate (it can
6734 contain line breaks). To get the translated version, press Ctrl +
6735 D. The translation will be displayed on the screen.
6737 A third way of translating with Apertium is using the \texttt{echo}
6738 command to send text through the translator:
6740 \begin{small}
6741 \begin{alltt}
6742 \$ echo "text to be translated" | apertium-translator . es-ca
6743 \end{alltt}
6744 \end{small}
6748 \chapter{Maintaining linguistic data}
6749 \label{se:datosling}
6751 \notavisible{Perhaps one could integrate material from Fran Tyers' howto as found in Apertium Wiki}
6752 \section[Description of current data]{Description of linguistic data
6753 currently available}
6755 At present, Apertium has linguistic data for three language pairs
6756 \nota{MG: This is old, needs UPDATING}: Spanish--Catalan and
6757 Spanish--Galician. The files containing the linguistic data are saved
6758 in a single directory: \texttt{apertium-es-ca} for the pair
6759 Spanish--Catalan and \texttt{apertium-es-gl} for the pair
6760 Spanish--Galician. The names of the files in this directory have the
6761 following structure:
6763 \begin{itemize}\setlength{\itemsep}{-\parsep}
6764 \item \texttt{apertium-PAIR.LANG.dix} : monolingual dictionary for
6765 LANG.
6766 \item \texttt{apertium-PAIR.LANG1-LANG2.dix} :
6767 \texttt{LANG1-LANG2} bilingual dictionary.
6768 \item \texttt{apertium-PAIR.trules-LANG1-LANG2.xml} : structural
6769 transfer rules for the translation from \texttt{LANG1} to
6770 \texttt{LANG2} .
6771 \item \texttt{apertium-PAIR.LANG.tsx} : tagger definition file for
6772 \texttt{LANG}.
6773 \item \texttt{apertium-PAIR.post-LANG.dix} : Post-generation
6774 dictionary for \texttt{LANG} (applies when translating into
6775 \texttt{LANG}).
6776 \item directory \texttt{LANG-tagger-data} : contains data needed
6777 for the \texttt{LANG} tagger (corpora, etc.)
6779 \end{itemize}
6781 \texttt{apertium-PAIR} refers to the linguistic combination of the
6782 translator. Its two possible values at the moment are
6783 \texttt{apertium-es-ca} and \\ \texttt{apertium-es-gl}. According to
6784 this structure, the Catalan monolingual dictionary is called
6785 \texttt{apertium-es-ca.ca.dix}, the Spanish--Galician bilingual
6786 dictionary is called \texttt{apertium-es-gl.es-gl.dix} and the
6787 structural transfer rules file for the translation from Catalan into
6788 Spanish is called \texttt{apertium-es-ca.trules-ca-es.xml}.
6791 The linguistic data available (by January 2006) for the different
6792 language pairs are summarized in the following table.
6793 \begin{small}
6794 \begin{center}
6795 \begin{tabular}{|p{8cm}|p{5cm}|} \hline
6796 \multicolumn{2}{|c|}{\textbf{Translator Apertium-es-ca}} \\ \hline
6797 Spanish monolingual dictionary & 11.800 entries \\ Catalan monolingual
6798 dictionary & 11.800 entries \\ Spanish--Catalan bilingual dictionary &
6799 12.800 entries (correspondences \texttt{es-ca})\\ Structural transfer
6800 rules from Spanish into Catalan & 44 rules \\ Structural transfer
6801 rules from Catalan into Spanish & 58 rules \\ Spanish post-generation
6802 dictionary & 25 entries and 5 paradigms\\ Catalan post-generation
6803 dictionary & 16 entries and 57 paradigms\\ \hline
6804 \multicolumn{2}{|c|}{\textbf{Translator Apertium-es-gl}} \\ \hline
6805 Spanish monolingual dictionary & 9.000 entries \\ Galician monolingual
6806 dictionary & 8.600 entries \\ Spanish--Galician bilingual dictionary &
6807 8.500 entries (correspondences \texttt{es-gl})\\ Structural transfer
6808 rules from Spanish into Galician & 46 rules \\ Structural transfer
6809 rules from Galician into Spanish & 38 rules \\ Spanish post-generation
6810 dictionary & 36 entries and 12 paradigms\\ Galician post-generation
6811 dictionary & 74 entries and 48 paradigms\\ \hline
6812 \end{tabular}
6813 \end{center}
6814 \end{small}
6817 \section[Adding words to dictionaries]{Adding words to monolingual and
6818 bilingual dictionaries}
6821 When extending or adapting Apertium, the most likely operation that
6822 will be performed will be to extend its dictionaries. In fact, it will
6823 be far more common than adding transfer or post-generation rules.
6825 We describe next the most important things one has to take into
6826 account when adding new words to the translator. This information is
6827 more general than the data provided in the section describing
6828 dictionaries (chapter \ref{ss:diccionarios}), although we give here
6829 some practical information that might be very useful to the users who
6830 decide to make changes in the translator.
6832 IMPORTANT: Every time a set
6833 of modifications is made to any of the dictionaries, the modules have
6834 to be recompiled. Type \emph{make} in the directory where the linguistic data
6835 are saved (apertium-es-ca, apertium-es-gl or what may be applicable)
6836 so that the system generates the new binary files.
6838 If you want to add a new word to Apertium, you need to add three
6839 entries in the dictionaries. Suppose you are working with the
6840 Spanish-Catalan pair. In this case, you have to add:
6842 \begin{enumerate}
6843 \item an entry in the Spanish monolingual dictionary: so that the
6844 translator can analyze ("understand") the word when it finds it in a
6845 text, and generate it when translating this word into Spanish.
6847 \item an entry in the bilingual dictionary: so that you can tell
6848 Apertium how to translate this word from one language to the other.
6850 \item an entry in the Catalan monolingual dictionary: so that the
6851 translator can analyze ("understand") the word when it finds it in a
6852 text, and generate it when translating this word into Catalan.
6853 \end{enumerate}
6855 You will need to go to the directory containing the XML dictionaries
6856 (for the Spanish-Catalan pair, this is \texttt{apertium-es-ca}) and
6857 open with a text editor or a specialized XML editor the three
6858 dictionary files mentioned: \texttt{apertium-es-ca.es.dix},
6859 \texttt{apertium-es-ca.es-ca.dix} and
6860 \texttt{apertium-es-ca.ca.dix}. The entries you need to create in
6861 these three dictionaries share a common structure. \\
6863 \textbf{Monolingual dictionary (Spanish)}
6866 You may want, for example, to add the Spanish adjective
6867 \emph{cósmico}, whose equivalent in Catalan is \emph{còsmic}. The
6868 first step is to add this word to the Spanish monolingual dictionary.
6870 You will see that a monolingual dictionary has basically two types of
6871 data: \textbf{paradigms} (in the "\texttt{<pardefs>}" section of the
6872 dictionary, each paradigm inside a \texttt{<pardef>} element) and
6873 \textbf{word entries} (in the main (\texttt{<section>} of the
6874 dictionary, each one inside an \texttt{<e>} element). Word entries
6875 consist of a lemma (that is, the word as you would find it in a
6876 typical paper dictionary) plus grammatical information; paradigms
6877 contain the inflection data of all lemmas in the dictionary. You can
6878 search a particular word by searching the string \texttt{lm="word"}
6879 (\texttt{lm} meaning \emph{lemma}). Bear in mind, however, that the
6880 element \texttt{lm} is optional and some other dictionaries may not
6881 contain it.
6883 Look at the word entries in the Spanish monolingual dictionary, for
6884 example at the entry for the adjective \emph{bonito}. You can find it
6885 by searching \texttt{lm="bonito"}:
6887 \begin{small}
6888 \begin{alltt}
6889 <\textbf{e} \textsl{lm}="bonito">
6890 <\textbf{i}>bonit</\textbf{i}>
6891 <\textbf{par} \textsl{n}="absolut/o__adj"/>
6892 </\textbf{e}>
6893 \end{alltt}
6894 \end{small}
6896 To add a word, you will have to create an entry with the same
6897 structure. The part between \texttt{<i>} and \texttt{</i>} contains
6898 the prefix of the word that is common to all inflected forms, and the
6899 element \texttt{<par>} refers to the inflection paradigm of this
6900 word. Therefore, this entry means that the adjective \emph{bonito}
6901 inflects like the adjective \emph{absoluto} and has the same
6902 morphological analysis: the forms \emph{bonit\textbf{o}},
6903 \emph{bonit\textbf{a}}, \emph{bonit\textbf{os}},
6904 \emph{bonit\textbf{as}} are equivalent to the forms
6905 \emph{absolut\textbf{o}}, \emph{absolut\textbf{a}},
6906 \emph{absolut\textbf{os}}, \emph{absolut\textbf{as}} and have the
6907 morphological analysis: \texttt{adj m sg}, \texttt{adj f sg},
6908 \texttt{adj m pl} and \texttt{adj f pl} respectively.
6910 Now, you have to decide which is the lexical category of the word you
6911 want to add: the word \emph{cósmico} is an adjective, like
6912 \emph{bonito}. Next, you have to find the appropriate paradigm for
6913 this adjective. Is it the same as the one for \emph{bonito} and
6914 \emph{absoluto}? ¿Can you say \emph{cósmic\textbf{o}},
6915 \emph{cósmic\textbf{a}}, \emph{cósmic\textbf{os}},
6916 \emph{cósmic\textbf{as}}? The answer is yes, and, with all this
6917 information, you can now create the correct entry:
6919 \begin{small}
6920 \begin{alltt}
6921 <\textbf{e} \textsl{lm}="cósmico">
6922 <\textbf{i}>cósmic</\textbf{i}>
6923 <\textbf{par} \textsl{n}="absolut/o__adj"/>
6924 </\textbf{e}>
6925 \end{alltt}
6926 \end{small}
6929 If the word you want to add has a different paradigm, you have to find
6930 it in the dictionary and assign it to the entry. You have two ways to
6931 find the appropriate paradigm: looking in the \texttt{<pardefs>}
6932 section of the dictionary, where all the paradigms are defined inside
6933 a \texttt{<pardef>} element, or finding another word that you think
6934 may already exist in the dictionary and that has the same inflection
6935 paradigm as the one to be added. For example, if you want to add the
6936 word \emph{genoma}, you need to find an appropriate paradigm for a
6937 \textbf{noun} whose gender is masculine and forms the plural with the
6938 addition of an \textbf{-s}. This will be the paradigm
6939 "\texttt{abismo\_\_n}" in our present dictionaries. Therefore, the
6940 entry for this new word would be:
6942 \begin{small}
6943 \begin{alltt}
6944 <\textbf{e} \textsl{lm}="genoma">
6945 <\textbf{i}>genoma</\textbf{i}>
6946 <\textbf{par} \textsl{n}="abismo__n"/>
6947 </\textbf{e}>
6948 \end{alltt}
6949 \end{small}
6951 In exceptional cases you will need to create a new paradigm for a
6952 certain word. You can look at the structure of other paradigms and
6953 create one accordingly. For a more detailed description of paradigms
6954 and word entries in the dictionaries, refer to section
6955 \ref{ss:diccionarios}. \\
6957 \textbf{Monolingual dictionary (Catalan)}
6959 Once you have added the word to one monolingual dictionary, you have
6960 to do the same to the other monolingual dictionary of the translation
6961 pair (in our example, the Catalan monolingual dictionary) using the
6962 same structure. The result would be:
6964 \begin{small}
6965 \begin{alltt}
6966 <\textbf{e} \textsl{lm}="còsmic">
6967 <\textbf{i}>còsmi</\textbf{i}>
6968 <\textbf{par} \textsl{n}="acadèmi/c__adj"/>
6969 </\textbf{e}>
6970 \end{alltt}
6971 \end{small}
6973 \textbf{Monolingual dictionary (Galician)}
6975 In the case you are trying to improve the XML dictionaries for the
6976 Spanish-Galician pair, you will need to go to the directory
6977 \texttt{apertium-es-gl} and open with a text editor or a specialized
6978 XML editor the three dictionary files \texttt{apertium-es-gl.es.dix},
6979 \texttt{apertium-es-gl.es-gl.dix} and
6980 \texttt{apertium-es-gl.gl.dix}. In that case, once you have added the
6981 new Spanish word \emph{genoma} to the Spanish monolingual dictionary
6982 (\texttt{apertium-es-gl.es.dix}), you have to add the equivalent
6983 Galician word \emph{xenoma} to the Galician monolingual dictionary
6984 (\texttt{apertium-es-gl.gl.dix}), that is:
6986 \begin{small}
6987 \begin{alltt}
6988 <\textbf{e} \textsl{lm}="xenoma">
6989 <\textbf{i}>xenoma</\textbf{i}>
6990 <\textbf{par} \textsl{n}="Xulio__n"/>
6991 </\textbf{e}>
6992 \end{alltt}
6993 \end{small}
6995 \textbf{Bilingual dictionary}
6997 The last step is to add the translation to the bilingual dictionary.
6999 A bilingual dictionary does not usually have paradigms, only
7000 lemmas. An entry contains only the lemma in both languages and the
7001 first grammatical symbol (the lexical category) of each one. Entries
7002 have a left side (\texttt{<l>}) and a right side (\texttt{<r>}), and
7003 each language has always to be in the same position: in our system, it
7004 has been agreed that Spanish occupies the left side, and Catalan,
7005 Galician and Portuguese the right side.
7008 With the addition of the lemma of both words, the system will
7009 translate all their inflected forms (the grammatical symbols are
7010 copied from the source language word to the target language
7011 word). This will only work if the source language word and the target
7012 language word are grammatically equivalent, that is, if they share
7013 exactly the same grammatical symbols for all of their inflected
7014 forms. This is the case with our example; therefore, the entry you
7015 have to add to the bilingual dictionary is:
7018 \begin{small}
7019 \begin{alltt}
7020 <\textbf{e}>
7021 <\textbf{p}>
7022 <\textbf{l}>cósmico<\textbf{s} \textsl{n}="adj"/></\textbf{l}>
7023 <\textbf{r}>còsmic<\textbf{s} \textsl{n}="adj"/></\textbf{r}>
7024 </\textbf{p}>
7025 </\textbf{e}>
7026 \end{alltt}
7027 \end{small}
7029 This entry will translate all the inflected forms, that is,
7030 \texttt{adj m sg}, \texttt{adj f sg}, \texttt{adj m pl} and
7031 \texttt{adj f pl}. It works for the translation in both directions:
7032 from Spanish to Catalan and from Catalan to Spanish.
7034 In the case of the Spanish-Galician pair, the following bilingual
7035 entry in the Spanish-Galician bilingual dictionary
7036 (\texttt{apertium-es-gl.es-gl.dix}) will translate all the inflected
7037 forms for the equivalent words \emph{genoma}/\emph{xenoma} in both
7038 directions:
7040 \begin{small}
7041 \begin{alltt}
7042 <\textbf{e}>
7043 <\textbf{p}>
7044 <\textbf{l}>genoma<\textbf{s} \textsl{n}="n"/></\textbf{l}>
7045 <\textbf{r}>xenoma<\textbf{s} \textsl{n}="n"/></\textbf{r}>
7046 </\textbf{p}>
7047 </\textbf{e}>
7048 \end{alltt}
7049 \end{small}
7051 What to do if the word pair is not equivalent grammatically (their
7052 grammatical symbols are not exactly the same)? In that case, you need
7053 to specify all the grammatical symbols (in the same order as they are
7054 specified in the monolingual dictionaries) until you reach the symbol
7055 that differs between the source language word and the target language
7056 word. For example, the Spanish noun \emph{limón} has masculine gender
7057 and its equivalent in Catalan, \emph{llimona}, has feminine
7058 gender. The entry in the bilingual dictionary must be as follows:
7060 \begin{small}
7061 \begin{alltt}
7062 <\textbf{e}>
7063 <\textbf{p}>
7064 <\textbf{l}>limón<\textbf{s} \textsl{n}="n"/><\textbf{s} \textsl{n}="m"/></\textbf{l}>
7065 <\textbf{r}>llimona<\textbf{s} \textsl{n}="n"/><\textbf{s} \textsl{n}="f"/></\textbf{r}>
7066 </\textbf{p}>
7067 </\textbf{e}>
7068 \end{alltt}
7069 \end{small}
7072 A more difficult problem arises when two words have different
7073 grammatical symbols and the grammatical information of the source
7074 language word is not enough to determine the gender (masculine or
7075 feminine) or the number (singular or plural) of the target language
7076 word. Take for example the Spanish adjective \emph{canadiense}. Its
7077 gender is masculine--feminine since it is invariable in gender, that
7078 is, it can go both with masculine and feminine nouns (\emph{hombre
7079 canadiense}, \emph{mujer canadiense}). In Catalan, on the other hand,
7080 the adjective has a different inflection for the masculine and the
7081 feminine (\emph{home canadenc}, \emph{dona canadenca}). Therefore,
7082 when translating from Spanish to Catalan it is not possible to know,
7083 without looking at the accompanying noun, whether the Spanish
7084 adjective (\emph{mf}) has to be translated as a feminine or a
7085 masculine adjective in Catalan. In that case, the symbol \texttt{GD}
7086 (for "gender to be determined") is used instead of the gender
7087 symbol. \label{GDND} The word's gender will be determined by the
7088 structural transfer module, by means of a transfer rule (a rule that
7089 detects the gender of the preceding noun in this particular
7090 case). Therefore, \texttt{GD} must be used only when translating from
7091 Spanish to Catalan, but not when translating from Catalan to Spanish,
7092 as in Spanish the gender will always be \texttt{mf} regardless of the
7093 gender of the original word. In the bilingual dictionary you will
7094 need to add, in this case, more than one entry with direction
7095 indications, as you must specify in which translation direction the
7096 gender remains undetermined. The entries for this adjective should be
7097 as follows:
7099 \begin{small}
7100 \begin{alltt}
7101 <\textbf{e} \textsl{r}="LR">
7102 <\textbf{p}>
7103 <\textbf{l}>canadiense<\textbf{s} \textsl{n}="adj"/><\textbf{s} \textsl{n}="mf"/></\textbf{l}>
7104 <\textbf{r}>canadenc<\textbf{s} \textsl{n}="adj"/><\textbf{s} \textsl{n}="GD"/></\textbf{r}>
7105 </\textbf{p}>
7106 </\textbf{e}>
7107 <\textbf{e} \textsl{r}="RL">
7108 <\textbf{p}>
7109 <\textbf{l}>canadiense<\textbf{s} \textsl{n}="adj"/><\textbf{s} \textsl{n}="mf"/></\textbf{l}>
7110 <\textbf{r}>canadenc<\textbf{s} \textsl{n}="adj"/><\textbf{s} \textsl{n}="f"/></\textbf{r}>
7111 </\textbf{p}>
7112 </\textbf{e}>
7113 <\textbf{e} \textsl{r}="RL">
7114 <\textbf{p}>
7115 <\textbf{l}>canadiense<\textbf{s} \textsl{n}="adj"/><\textbf{s} \textsl{n}="mf"/></\textbf{l}>
7116 <\textbf{r}>canadenc<\textbf{s} \textsl{n}="adj"/><\textbf{s} \textsl{n}="m"/></\textbf{r}>
7117 </\textbf{p}>
7118 </\textbf{e}>
7119 \end{alltt}
7120 \end{small}
7122 "\texttt{LR}" means \emph{left to right} and "\texttt{RL}",
7123 \emph{right to left}. Since Spanish is on the left and Catalan on the
7124 right, the adjective will be \texttt{GD} only when translating from
7125 Spanish to Catalan (\texttt{LR}). For the translation \texttt{RL} you
7126 need to create two entries, one for the adjective in feminine and
7127 another one for the adjective in masculine.\footnote{You could also
7128 group them using a small paradigm}
7130 The same principle applies when it is not possible to determine the
7131 number of the target word for the same reasons mentioned above. For
7132 example, the Spanish noun \emph{rascacielos} ("skyscraper") is
7133 invariable in number, that is, it can be singular as well as plural
7134 (\emph{un rascacielos}, \emph{dos rascacielos}). In Catalan, on the
7135 other hand, the noun has a different inflection for the singular and
7136 for the plural (\emph{un gratacel}, \emph{dos gratacels}). In this
7137 case the symbol used is "\texttt{ND}" ("number to be determined") and
7138 the entries should be like this:
7141 \begin{small}
7142 \begin{alltt}
7143 <\textbf{e} \textsl{r}="LR">
7144 <\textbf{p}>
7145 <\textbf{l}>rascacielos<\textbf{s} \textsl{n}="n"/><\textbf{s} \textsl{n}="m"/><\textbf{s} \textsl{n}="sp"/></\textbf{l}>
7146 <\textbf{r}>gratacel<\textbf{s} \textsl{n}="n"/><\textbf{s} \textsl{n}="m"/><\textbf{s} \textsl{n}="ND"/></\textbf{r}>
7147 </\textbf{p}>
7148 </\textbf{e}>
7149 <\textbf{e} \textsl{r}="RL">
7150 <\textbf{p}>
7151 <\textbf{l}>rascacielos<\textbf{s} \textsl{n}="n"/><\textbf{s} \textsl{n}="m"/><\textbf{s} \textsl{n}="sp"/></\textbf{l}>
7152 <\textbf{r}>gratacel<\textbf{s} \textsl{n}="n"/><\textbf{s} \textsl{n}="m"/><\textbf{s} \textsl{n}="pl"/></\textbf{r}>
7153 </\textbf{p}>
7154 </\textbf{e}>
7155 <\textbf{e} \textsl{r}="RL">
7156 <\textbf{p}>
7157 <\textbf{l}>rascacielos<\textbf{s} \textsl{n}="n"/><\textbf{s} \textsl{n}="m"/><\textbf{s} \textsl{n}="sp"/></\textbf{l}>
7158 <\textbf{r}>gratacel<\textbf{s} \textsl{n}="n"/><\textbf{s} \textsl{n}="m"/><\textbf{s} \textsl{n}="sg"/></\textbf{r}>
7159 </\textbf{p}>
7160 </\textbf{e}>
7161 \end{alltt}
7162 \end{small}
7164 For a more detailed description of this kind of entries, refer to
7165 section~\pageref{ss:bil}.
7169 \subsection{Adding direction restrictions}
7171 In the previous example we have already seen the use of direction
7172 restrictions for entries with undetermined gender or number
7173 (\texttt{GD} or \texttt{ND}). These restrictions can also be used in
7174 other cases.
7176 It is important to note that the current version of Apertium can give
7177 only a single equivalent for each source-language lexical form
7178 \nota{NEEDS UPDATING, reference to lextor} (a lexical form is the
7179 lemma plus its grammatical information), that is, no word-sense
7180 disambiguation is performed.\footnote{The system performs only
7181 part-of-speech disambiguation for homograph words, that is, for
7182 ambiguous words that can be analyzed as more than one lexical form,
7183 like \emph{vino} in Spanish, that can mean both "wine" and "he/she
7184 came". This type of disambiguation is performed by the tagger.} When a
7185 lexical form can be translated in two or more different ways, one has
7186 to be chosen (the most general, the most frequent, etc.). You can
7187 tell Apertium that a certain word has to be analyzed ("understood")
7188 but not generated, as it is not the translation of any word in the
7189 other language.
7191 Let's see this with an example. The Spanish noun \emph{muñeca} can be
7192 translated in two different ways in Catalan depending on its meaning:
7193 \emph{canell} ("wrist") or \emph{nina} ("doll"). The context decides
7194 which translation is the correct one, but in its present state
7195 Apertium can not make such a decision .\footnote{See Section
7196 \ref{multi} on multiword units for ways to circumvent this problem.}
7197 Therefore, you have to decide which word you want as an equivalent
7198 when translating from Spanish to Catalan. From Catalan to Spanish,
7199 both words can be translated as \emph{muñeca} without any problem. You
7200 have to specify all these circumstances in the dictionary entries
7201 using direction restrictions (\texttt{LR} meaning "left to right",
7202 that is, \texttt{es}--\texttt{ca}, and \texttt{RL} meaning "right to
7203 left", that is, \texttt{ca}--\texttt{es}). If you decide to translate
7204 \emph{muñeca} as \emph{canell} in all cases, the entries in the
7205 bilingual dictionary shall be:
7208 \begin{small}
7209 \begin{alltt}
7210 <\textbf{e}>
7211 <\textbf{p}>
7212 <\textbf{l}>muñeca<\textbf{s} \textsl{n}="n"/><\textbf{s} \textsl{n}="f"/></\textbf{l}>
7213 <\textbf{r}>canell<\textbf{s} \textsl{n}="n"/><\textbf{s} \textsl{n}="m"/></\textbf{r}>
7214 </\textbf{p}>
7215 </\textbf{e}>
7217 <\textbf{e} \textsl{r}="RL">
7218 <\textbf{p}>
7219 <\textbf{l}>muñeca<\textbf{s} \textsl{n}="n"/></\textbf{l}>
7220 <\textbf{r}>nina<\textbf{s} \textsl{n}="n"/></\textbf{r}>
7221 </\textbf{p}>
7222 </\textbf{e}>
7223 \end{alltt}
7224 \end{small}
7226 This means that translation directions will be:
7227 \begin{small}
7228 \begin{alltt}
7229 muñeca --> canell
7230 muñeca <-- canell
7231 muñeca <-- nina
7232 \end{alltt}
7233 \end{small}
7235 (Note that that there is also a gender change in the case of
7236 \emph{muñeca} (feminine) and \emph{canell} (masculine)).
7238 It should be emphasized that a lemma can not have two translations in
7239 the target language, because the system would give an error when
7240 translating that lemma (see Section \ref{errores} "Detecting errors"
7241 to see how to find and correct these and other types of errors). When
7242 a word can be translated in two different ways in the target language
7243 in all contexts, you need to choose one as the translation equivalent
7244 and leave the other one as a lemma that can be analyzed but not
7245 generated, using direction restrictions like in the previous
7246 example. For example, the Catalan lemmas \emph{mot} and \emph{paraula}
7247 can be both translated into Spanish as \emph{palabra} ("word") and the
7248 entries in the bilingual dictionary should look like this:
7250 \begin{small}
7251 \begin{alltt}
7252 <\textbf{e}>
7253 <\textbf{p}>
7254 <\textbf{l}>palabra<\textbf{s} \textsl{n}="n"/></\textbf{l}>
7255 <\textbf{r}>paraula<\textbf{s} \textsl{n}="n"/></\textbf{r}>
7256 </\textbf{p}>
7257 </\textbf{e}>
7259 <\textbf{e} \textsl{r}="RL">
7260 <\textbf{p}>
7261 <\textbf{l}>palabra<\textbf{s} \textsl{n}="n"/><\textbf{s} \textsl{n}="f"/></\textbf{l}>
7262 <\textbf{r}>mot<\textbf{s} \textsl{n}="n"/><\textbf{s} \textsl{n}="m"/></\textbf{r}>
7263 </\textbf{p}>
7264 </\textbf{e}>
7265 \end{alltt}
7266 \end{small}
7268 Therefore, for this lemmas the translation directions will be:
7269 \begin{small}
7270 \begin{alltt}
7271 palabra --> paraula
7272 palabra <-- paraula
7273 palabra <-- mot
7274 \end{alltt}
7275 \end{small}
7277 One may have to specify restrictions regarding translation direction
7278 also in monolingual dictionaries. For example, both Spanish forms
7279 \emph{cantaran} and \emph{cantasen} should be analyzed as lemma
7280 \emph{cantar}, verb, subjunctive imperfect, 3rd person plural, but
7281 when generating Spanish text, one has to decide which one will be
7282 generated. Monolingual dictionaries are read in two directions
7283 depending on its purpose: for the analysis, the reading direction is
7284 left to right; for the generation, right to left. Therefore, a word
7285 that must be analyzed but not generated must have the restriction
7286 \texttt{LR}, and a word that must be generated but not analyzed must
7287 have the restriction \texttt{RL}.
7290 The case of \emph{cantaran} or \emph{cantasen} must have already been
7291 taken care of in inflection paradigms and it is unlikely to be a
7292 problem for most people extending a dictionary. In some other cases it
7293 can be necessary to introduce a restriction in the word entries of
7294 monolingual dictionaries.
7296 \subsection{Adding multiwords}
7297 \label{multi}
7299 It is possible to create entries consisting of two ore more words, if
7300 these words are considered to build a single "translation unit".
7301 These multiword units can also be useful when it comes to select the
7302 correct equivalent for a word inside a fixed expression. For example,
7303 the Spanish word \emph{dirección} may be translated into two Catalan
7304 words: \emph{direcció} ("direction, management, directorate,
7305 steering", etc.) and \emph{adreça} ("address"); including, for
7306 example, frequent multiword units such as \emph{dirección general}
7307 \(\to\) \emph{direcció general} ("general directorate") and
7308 \emph{dirección postal} \(\to\) \emph{adreça postal} ("postal
7309 address") may help get improved translations in some situations.
7311 Multiword units can be classified basically into two categories:
7312 multiwords with inner inflection and multiwords without inner
7313 inflection.
7315 \subsubsection{Multiwords without inner inflection}
7317 They are just like the normal one-word entries, with the only
7318 difference that you need to insert the element \texttt{<b>} (which
7319 represents a blank) between the individual words that make up the
7320 unit. Therefore, if you want to add, for example, the Spanish
7321 multiword \emph{hoy en día} ("nowadays"), whose equivalent in Catalan
7322 is \emph{avui dia}, the entries you need to add to the different
7323 dictionaries are:
7325 \begin{itemize}
7327 \item Spanish monolingual dictionary:
7328 \begin{small}
7329 \begin{alltt}
7330 <\textbf{e} \textsl{lm}="hoy en día">
7331 <\textbf{i}>hoy<\textbf{b}/>en<\textbf{b}/>día</\textbf{i}>
7332 <\textbf{par} \textsl{n}="ahora__adv"/>
7333 </\textbf{e}>
7334 \end{alltt}
7335 \end{small}
7337 \item Catalan monolingual dictionary:
7338 \begin{small}
7339 \begin{alltt}
7340 <\textbf{e} \textsl{lm}="avui dia">
7341 <\textbf{i}>avui<\textbf{b}/>dia</\textbf{i}>
7342 <\textbf{par} \textsl{n}="ahir__adv"/>
7343 </\textbf{e}>
7344 \end{alltt}
7345 \end{small}
7347 \item Spanish-Catalan bilingual dictionary:
7348 \begin{small}
7349 \begin{alltt}
7350 <\textbf{e}>
7351 <\textbf{p}>
7352 <\textbf{l}>hoy<\textbf{b}/>en<\textbf{b}/>día<\textbf{s} \textsl{n}="adv"/></\textbf{l}>
7353 <\textbf{r}>avui<\textbf{b}/>dia<\textbf{s} \textsl{n}="adv"/></\textbf{r}>
7354 </\textbf{p}>
7355 </\textbf{e}>
7356 \end{alltt}
7357 \end{small}
7359 \end{itemize}
7361 For Spanish-Galician pair, if you want to add, for example, the
7362 Spanish multiword \emph{manga por hombro} ("disarranged"), whose
7363 equivalent in Galician is \emph{sen xeito nin modo}, the entries you
7364 need to add are:
7366 \begin{itemize}
7368 \item Spanish monolingual dictionary:
7369 \begin{small}
7370 \begin{alltt}
7371 <\textbf{e} \textsl{lm}="manga por hombro">
7372 <\textbf{i}>manga<\textbf{b}/>por<\textbf{b}/>hombro</\textbf{i}>
7373 <\textbf{par} \textsl{n}="ahora__adv"/>
7374 </\textbf{e}>
7375 \end{alltt}
7376 \end{small}
7378 \item Galician monolingual dictionary:
7379 \begin{small}
7380 \begin{alltt}
7381 <\textbf{e} \textsl{lm}="sen xeito nin modo">
7382 <\textbf{i}>sen<\textbf{b}/>xeito<\textbf{b}/>nin<\textbf{b}/>modo</\textbf{i}>
7383 <\textbf{par} \textsl{n}="Deo_gratias__adv"/>
7384 </\textbf{e}>
7385 \end{alltt}
7386 \end{small}
7388 \item Spanish-Galician bilingual dictionary:
7389 \begin{small}
7390 \begin{alltt}
7391 <\textbf{e}>
7392 <\textbf{p}>
7393 <\textbf{l}>manga<\textbf{b}/>por<\textbf{b}/>hombro<\textbf{s} \textsl{n}="adv"/></\textbf{l}>
7394 <\textbf{r}>sen<\textbf{b}/>xeito<\textbf{b}/>nin<\textbf{b}/>modo<\textbf{s} \textsl{n}="adv"/></\textbf{r}>
7395 </\textbf{p}>
7396 </\textbf{e}>
7397 \end{alltt}
7398 \end{small}
7400 \end{itemize}
7402 \subsubsection{Brief introduction to paradigms}
7404 The paradigms of the previous examples, as adverbs do not inflect,
7405 contain only the grammatical symbol of the lexical form, as you see in
7406 this example:
7408 \begin{small}
7409 \begin{alltt}
7410 <\textbf{pardef} \textsl{n}="ahora__adv">
7411 <\textbf{e}>
7412 <\textbf{p}>
7413 <\textbf{l}/>
7414 <\textbf{r}><\textbf{s} \textsl{n}="adv"/></\textbf{r}>
7415 </\textbf{p}>
7416 </\textbf{e}>
7417 </\textbf{pardef}>
7418 \end{alltt}
7419 \end{small}
7421 Paradigms are build like a lexical entry. We have seen so far lexical
7422 entries where the common part of the lemma is put between \texttt{<i>}
7423 \texttt{</i>}:
7425 \begin{small}
7426 \begin{alltt}
7427 <\textbf{e} \textsl{lm}="cósmico">
7428 <\textbf{i}>cósmic</\textbf{i}>
7429 <\textbf{par} \textsl{n}="absolut/o__adj"/>
7430 </\textbf{e}>
7431 \end{alltt}
7432 \end{small}
7435 But you can also express the same with a pair of strings: a left
7436 string \texttt{<l>} and a right string \texttt{<r>} inside a
7437 \texttt{<p>} element:
7439 \begin{small}
7440 \begin{alltt}
7441 <\textbf{e} \textsl{lm}="cósmico">
7442 <\textbf{p}>
7443 <\textbf{l}>cósmic</\textbf{l}>
7444 <\textbf{r}>cósmic</\textbf{r}>
7445 </\textbf{p}>
7446 <\textbf{par} \textsl{n}="absolut/o__adj"/>
7447 </\textbf{e}>
7448 \end{alltt}
7449 \end{small}
7452 These two entries are equivalent. The use of the \texttt{<i>} element
7453 helps get more simple and compact entries, and you can use it when the
7454 left side and the right side of the string pair are identical. As has
7455 been explained before, monolingual dictionaries are read \texttt{LR}
7456 for the analysis of a text and \texttt{RL} for the
7457 generation. Therefore, when there is some difference between the
7458 analysed string and the generated string (not very usual) the entry
7459 can not be written using the \texttt{<i>} element. This is what
7460 happens in paradigms, where the left and right strings are never
7461 identical, since the right side must contain the grammatical symbols
7462 that will go through all the modules of the system.
7464 \subsubsection{Multiwords with inner inflection}
7467 They consist of a word that can inflect (typically a verb) followed by
7468 one or more invariable words. For these entries you need to specify
7469 the inflection paradigm just after the word that inflects. The
7470 invariable part must be marked with the element \texttt{<g>} (for
7471 \emph{group}) in the right side. The blanks between words are
7472 indicated, like in the previous case, with the element
7473 \texttt{<b>}. Look at the following example for the Spanish multiword
7474 \emph{echar de menos} (to miss), translated into Catalan as
7475 \emph{trobar a faltar}:
7477 \begin{itemize}
7479 \item Spanish monolingual dictionary:
7480 \begin{small}
7481 \begin{alltt}
7482 <\textbf{e} \textsl{lm}="echar de menos">
7483 <\textbf{i}>ech</\textbf{i}>
7484 <\textbf{par} \textsl{n}="aspir/ar__vblex"/>
7485 <\textbf{p}>
7486 <\textbf{l}><\textbf{b}/>de<\textbf{b}/>menos</\textbf{l}>
7487 <\textbf{r}><\textbf{g}><\textbf{b}/>de<\textbf{b}/>menos</\textbf{g}></\textbf{r}>
7488 </\textbf{p}>
7489 </\textbf{e}>
7490 \end{alltt}
7491 \end{small}
7493 \item Catalan monolingual dictionary:
7494 \begin{small}
7495 \begin{alltt}
7496 <\textbf{e} \textsl{lm}="trobar a faltar">
7497 <\textbf{i}>trob</\textbf{i}>
7498 <\textbf{par} \textsl{n}="abander/ar__vblex"/>
7499 <\textbf{p}>
7500 <\textbf{l}><\textbf{b}/>a<\textbf{b}/>faltar</\textbf{l}>
7501 <\textbf{r}><\textbf{g}><\textbf{b}/>a<\textbf{b}/>faltar</\textbf{g}></\textbf{r}>
7502 </\textbf{p}>
7503 </\textbf{e}>
7504 \end{alltt}
7505 \end{small}
7507 \item Spanish-Catalan bilingual dictionary:
7508 \begin{small}
7509 \begin{alltt}
7510 <\textbf{e}>
7511 <\textbf{p}>
7512 <\textbf{l}>echar<\textbf{g}><\textbf{b}/>de<\textbf{b}/>menos</\textbf{g}><\textbf{s} \textsl{n}="vblex"/></\textbf{l}>
7513 <\textbf{r}>trobar<\textbf{g}><\textbf{b}/>a<\textbf{b}/>faltar</\textbf{g}><\textbf{s} \textsl{n}="vblex"/></\textbf{r}>
7514 </\textbf{p}>
7515 </\textbf{e}>
7516 \end{alltt}
7517 \end{small}
7519 \end{itemize}
7522 Note that the grammatical symbol is appended at the end, after the
7523 group marked with the \texttt{<g>}.
7525 It can be the case that a lemma is a multiword of this kind in one
7526 language and a single word in the other language. In that case, in the
7527 bilingual dictionary, the multiword will contain the \texttt{<g>}
7528 element and the single word will not. In the monolingual dictionaries,
7529 each entry will be created according to its type. Look at the
7530 following example for the Spanish multiword \emph{darse cuenta} (to
7531 realize), translated into Catalan as the verb
7532 \emph{adonar-se}:\footnote{The verb \emph{adonar-se} is considered a
7533 simple word, since the incorporation of enclitic pronouns (such as
7534 "-se") is treated inside the inflection paradigms of verbs (for all
7535 the Romance languages of \emph{Apertium}); therefore, it is not
7536 necessary to specify them in lexical entries. The correct placement of
7537 clitic pronouns is one of the main reasons for using the
7538 \texttt{<g>}... \texttt{</g>} labels around the invariable part of
7539 multi-word verbs.}
7541 \begin{itemize}
7543 \item Spanish monolingual dictionary:
7544 \begin{small}
7545 \begin{alltt}
7546 <\textbf{e} \textsl{lm}="darse cuenta">
7547 <\textbf{i}>d</\textbf{i}>
7548 <\textbf{par} \textsl{n}="d/ar__vblex"/>
7549 <\textbf{p}>
7550 <\textbf{l}><\textbf{b}/>cuenta</\textbf{l}>
7551 <\textbf{r}><\textbf{g}><\textbf{b}/>cuenta</\textbf{g}></\textbf{r}>
7552 </\textbf{p}>
7553 </\textbf{e}>
7554 \end{alltt}
7555 \end{small}
7557 \item Catalan monolingual dictionary:
7558 \begin{small}
7559 \begin{alltt}
7560 <\textbf{e} \textsl{lm}="adonar-se">
7561 <\textbf{i}>adon</\textbf{i}>
7562 <\textbf{par} \textsl{n}="abander/ar__vblex"/>
7563 </\textbf{e}>
7564 \end{alltt}
7565 \end{small}
7567 \item Spanish-Catalan bilingual dictionary:
7568 \begin{small}
7569 \begin{alltt}
7570 <\textbf{e}>
7571 <\textbf{p}>
7572 <\textbf{l}>dar<\textbf{g}><\textbf{b}/>cuenta</\textbf{g}><\textbf{s} \textsl{n}="vblex"/></\textbf{l}>
7573 <\textbf{r}>adonar<\textbf{s} \textsl{n}="vblex"/></\textbf{r}>
7574 </\textbf{p}>
7575 </\textbf{e}>
7576 \end{alltt}
7577 \end{small}
7579 \end{itemize}
7581 The same principles and actions described for basic entries (gender
7582 and number change, direction restrictions, etc.) apply to all kinds of
7583 multiwords. For a more detailed description of multiword units, refer
7584 to section~\ref{ss:multipalabras}.
7586 \subsection{Consider contributing your improved lexical data}
7588 If you have successfully added general-purpose lexical data to any of
7589 the Apertium language pairs, please consider contributing it to the
7590 project so that we can offer a better toolbox to the community. You
7591 can e-mail your data (in three XML files, one for each monolingual
7592 dictionary and another one for the bilingual dictionary) to the
7593 following addresses: \\
7595 \begin{tabular}{ll}
7596 Spanish-Catalan data & Mireia Ginestí: \texttt{mginesti@dlsi.ua.es}\\
7597 Spanish-Portuguese data & Carme Armentano: \texttt{carmentano@dlsi.ua.es}\footnote{The group at the
7598 Universitat d'Alacant has also developed data for this language pair
7599 outside the present project.}\\
7600 Spanish-Galician data & Xavier Gómez-Guinovart: \texttt{xgg@uvigo.es}\\\\
7602 \end{tabular}
7605 If you believe you are going to contribute more heavily to the
7606 project, you can join the development team through
7607 www.sourceforge.net. If you do not have a Sourceforge account, please
7608 create one; then write to Mikel L. Forcada (\texttt{mlf@ua.es}) or
7609 Sergio Ortiz (\texttt{sortiz@dlsi.ua.es}), or to Xavier Gómez
7610 Guinovart if you are interested in the Spanish-Galician language pair,
7611 explaining briefly your motivations and background to join the
7612 project. The usual way to contribute is to use CVS; as a project
7613 member, you will be able to commit your changes to dictionaries
7614 directly.
7616 The addition of simple lexical contributions will soon be made simpler
7617 by means of web forms in
7618 \url{http://xixona.dlsi.ua.es/prototype/webform/}, so that
7619 contributors do not have to deal directly with XML.
7622 You should be aware that the data you contribute to the project, once
7623 added, will be freely distributed under the current license (GNU
7624 General Public License or Creative Commons 2.5
7625 attribution-sharealike-noncommercial, as indicated). Make sure the
7626 data you contribute is not affected by any kind of license which may
7627 be incompatible with the licenses used in this project. No kind of
7628 agreement or contract is created between you and the developers. If
7629 you have any doubt, or you plan to make a massive contribution,
7630 contact Mikel L. Forcada.
7633 \section[Adding structural transfer rules]{Adding structural transfer
7634 (grammar) rules}
7636 The content in this chapter partially repeats information already
7637 presented in the chapter describing the structural transfer module
7638 (Section \ref{ss:transfer}), although rules are described here in a
7639 more general and practical way, aimed at those who wish a first
7640 approach to them.
7642 Structural transfer rules carry out transformations to the analysed
7643 and disambiguated text, which are needed because of grammatical,
7644 syntactical and lexical divergences between the two languages involved
7645 (gender and number changes to ensure agreement in the target language,
7646 word reorderings, changes in prepositions, etc.). The rules detect
7647 patterns (sequences) of source text lexical forms and apply to them
7648 the corresponding transformations. The module detects the patterns in
7649 a left-to-right, longest-match way; for example, the phrase \emph{the
7650 big cat} will be detected and processed by the rule for
7651 \emph{determiner}--\emph{adjective}--\emph{noun} and not by the rule
7652 for \emph{determiner}--\emph{adjective}, since the first pattern is
7653 longer. If two patterns have the same length, the rule that applies is
7654 the one defined in the first place.
7656 The structural transfer module (generated from the structural transfer
7657 rules file) calls the lexical transfer module (generated from the
7658 bilingual dictionary) all through the process to determine the target
7659 language equivalents of the source language lexical forms.
7661 The structural transfer rules are contained in a XML file, one for
7662 each translation direction (for example, for the translation from
7663 Spanish to Catalan, the file is
7664 \texttt{apertium-es-ca.trules-es-ca.xml}). You need to edit this file
7665 if you want to add or change transfer rules.
7667 Rules have a \textbf{pattern} and an \textbf{action} part. The pattern
7668 specifies which sequences of lexical forms have to be detected and
7669 processed. The action describes the verifications and transformations
7670 that need to be done on its constituents. Usual transformation
7671 operations (such as gender and number agreement) are defined inside a
7672 macroinstruction which is called inside the rule. At the end of the
7673 action part of the rule, the resulting lexical forms in the target
7674 language are sent out so that they are processed by the next modules
7675 in the translation system.
7677 A transfer rules file contains four sections with definitions of
7678 elements used in the rules, and a fifth section where the actual rules
7679 are defined. The sections are the following:
7681 \begin{itemize}
7683 \item \texttt{<section-def-cats>}: This section contains the
7684 definition of the categories which are to be used in the rule
7685 patterns (that is, the type of lexical forms that will be detected
7686 by a certain rule). For the rule presented below, the categories
7687 \texttt{det} and \texttt{nom} (determiner and noun) need to be
7688 defined here. Categories are defined specifying the grammatical
7689 symbols that the lexical forms have. An asterisk indicates that one
7690 or more grammatical symbols follow the ones specified. The following
7691 is the definition of the category \texttt{det}, which groups
7692 determiners and predeterminers\footnote{such as in Spanish
7693 \emph{todo}, \emph{toda}, \emph{todos}, \emph{todas}} in the same
7694 category since they play the same role for transfer purposes:
7696 \begin{small}
7697 \begin{alltt}
7698 <\textbf{def-cat} \textsl{n}="det">
7699 <\textbf{cat-item} \textsl{tags}="det.*"/>
7700 <\textbf{cat-item} \textsl{tags}="predet.*"/>
7701 </\textbf{def-cat}>
7702 \end{alltt}
7703 \end{small}
7705 It is also possible to define as a category a certain lemma, like the
7706 following for the preposition \texttt{en}:
7708 \begin{small}
7709 \begin{alltt}
7710 <\textbf{def-cat} \textsl{n}="en">
7711 <\textbf{cat-item} \textsl{lemma}="en" \textsl{tags}="pr"/>
7712 </\textbf{def-cat}>
7713 \end{alltt}
7714 \end{small}
7717 \item \texttt{<section-def-attrs>}: This section contains the
7718 definition of the attributes that will be used inside of the rules, in
7719 the action part. You need attributes for all the categories defined in
7720 the previous section, if they are to be used in the action part of the
7721 rule (to make verifications on them or to send them out at the end of
7722 the rule), as well as for other attributes needed in the rule (such as
7723 gender or number). Attributes have to be defined using their
7724 corresponding grammatical symbols and can not have asterisks; its name
7725 must be unique. The following are the definitions for the attributes
7726 \texttt{a\_det} (for determiners) and \texttt{gen} (gender):
7728 \begin{small}
7729 \begin{alltt}
7730 <\textbf{def-attr} \textsl{n}="a_det">
7731 <\textbf{attr-item} \textsl{tags}="det.def"/>
7732 <\textbf{attr-item} \textsl{tags}="det.ind"/>
7733 <\textbf{attr-item} \textsl{tags}="det.dem"/>
7734 <\textbf{attr-item} \textsl{tags}="det.pos"/>
7735 <\textbf{attr-item} \textsl{tags}="predet"/>
7736 </\textbf{def-attr}>
7738 <\textbf{def-attr} \textsl{n}="gen">
7739 <\textbf{attr-item} \textsl{tags}="m"/>
7740 <\textbf{attr-item} \textsl{tags}="f"/>
7741 <\textbf{attr-item} \textsl{tags}="mf"/>
7742 <\textbf{attr-item} \textsl{tags}="nt"/>
7743 <\textbf{attr-item} \textsl{tags}="GD"/>
7744 </\textbf{def-attr}>
7746 \end{alltt}
7747 \end{small}
7749 \item \texttt{<section-def-vars>}: This section contains the
7750 definition of the variables used in the rules.
7752 \begin{small}
7753 \begin{alltt}
7754 <\textbf{def-var} \textsl{n}="interrogativa"/>
7755 \end{alltt}
7756 \end{small}
7758 \item \texttt{<section-def-macros>}: Here the macroinstructions are
7759 defined, which contain sequences of code that are frequently used in
7760 the rules; this way, linguists do not need to write the same actions
7761 repeatedly. There are, for example, macroinstructions for gender and
7762 number agreement operations.
7764 \item \texttt{<section-def-rules>}: This is the section where the
7765 structural transfer rules are written.
7767 \end{itemize}
7769 The following is an example of a rule which detects the sequence
7770 \emph{determiner--noun}:
7772 \begin{small}
7773 \begin{alltt}
7774 <\textbf{rule}>
7775 <\textbf{pattern}>
7776 <\textbf{pattern-item} \textsl{n}="det"/>
7777 <\textbf{pattern-item} \textsl{n}="nom"/>
7778 <\textbf{/pattern}>
7779 <\textbf{action}>
7780 <\textbf{call-macro} \textsl{n}="f_concord2">
7781 <\textbf{with-param} \textsl{pos}="2"/>
7782 <\textbf{with-param} \textsl{pos}="1"/>
7783 </\textbf{call-macro}>
7784 <\textbf{out}>
7785 <\textbf{lu}>
7786 <\textbf{clip} \textsl{pos}="1" \textsl{side}="tl" \textsl{part}="whole"/>
7787 </\textbf{lu}>
7788 <\textbf{b} \textsl{pos}="1"/>
7789 <\textbf{lu}>
7790 <\textbf{clip} \textsl{pos}="2" \textsl{side}="tl" \textsl{part}="whole"/>
7791 </\textbf{lu}>
7792 </\textbf{out}>
7793 </\textbf{action}>
7794 </\textbf{rule}>
7795 \end{alltt}
7796 \end{small}
7798 Part of the action performed on this pattern is specified inside the
7799 macroinstruction \texttt{f\_concord2}, which is defined in the
7800 \texttt{<section-def-macros>}. It performs gender and number agreement
7801 operations: if there is a gender or number change between the source
7802 language and the target language (in the noun), the determiner changes
7803 its gender or number accordingly; furthermore, if gender or number are
7804 undetermined (\texttt{GD} or \texttt{ND}\footnote{See pages
7805 \pageref{pg:GD} or \pageref{GDND}}), the noun receives the correct
7806 gender or number values from the preceding determiner. In the Apertium
7807 es--ca, es--gl and es--pt systems, there are agreement
7808 macroinstructions defined for one, two, three or four lexical units
7809 (\texttt{f\_concord1}, \texttt{f\_concord2}, \texttt{f\_concord3},
7810 \texttt{f\_concord4}). When calling the macroinstructions in a rule,
7811 it must be specified which is the main lexical unit (the one which
7812 most heavily determines the gender or number of the other lexical
7813 units) and which other lexical units of the pattern have to be
7814 included in the agreement operations, in order of importance. This is
7815 done with the \texttt{<with-param pos=""/>} element. In the presented
7816 rule, the main lexical unit is the noun (position "2" in the pattern)
7817 and the second one is the determiner (positions "1" in the pattern).
7819 After the pertinent actions, the resulting lexical forms are sent out,
7820 inside the \texttt{<out>} element. Each lexical unit is defined with a
7821 \texttt{<clip>}. Its attributes mean the following:
7823 \begin{itemize}
7825 \item [-]\texttt{pos}: refers to the position of the lexical form in
7826 the pattern; \texttt{1} is the first lexical form (the determiner) and
7827 \texttt{2} the second one (the noun).
7829 \item [-]\texttt{side}: indicates if the lexical form is in the source
7830 language (\texttt{sl}) or in the target language (\texttt{tl}). Of
7831 course, words are sent out always in the target language; source
7832 language lexical forms may be needed inside of a rule, when testing
7833 its attributes or characteristics.
7835 \item [-]\texttt{part}: indicates which part of the lexical form is
7836 referred to in the \texttt{clip}. You can use some predefined values:
7838 \begin{itemize}
7840 \item [-]\texttt{whole}: the whole lexical form (lemma and grammatical
7841 symbols). Used only when sending out the lexical unit (inside an
7842 \texttt{<out>} element).
7844 \item [-]\texttt{lem}: the lemma of the lexical unit
7846 \item [-]\texttt{lemh}: the head of the lemma of a multiword with
7847 inner inflection (see Section \ref{multi} in this chapter, or
7848 Section~\ref{ss:multipalabras} if you wish a more detailed
7849 description)
7851 \item [-]\texttt{lemq}: the queue of a lemma of a multiword with inner
7852 inflection
7855 \end{itemize}
7857 Apart from these predefined values, you can use any of the attributes
7858 defined in \texttt{<section-def-attrs>} (for example \texttt{gen} or
7859 \texttt{a\_det}).
7861 The values \texttt{lemh} and \texttt{lemq} are used when sending out
7862 multiwords with inner inflection in order to place the head and the
7863 queue of the lemma in the right position, since the previous module
7864 moved the queue just after the lemma head for various reasons. In
7865 practice, in our system, this means that you must use these values
7866 instead of \texttt{whole} when sending out verbs. This is because, in
7867 our dictionaries, multiwords with inner inflection are always verbs
7868 \nota{NEEDS UPDATING}and, if you use the value \texttt{whole} when
7869 sending them out, the multiword would not be well formed (the head and
7870 the queue of the lemma would not have the correct position and the
7871 multiword could not be generated by the generator).
7873 \end{itemize}
7876 Therefore, a rule that has a verb in its pattern must send the lexical
7877 forms like in the following two examples:
7879 \label{regla_verbo1}
7880 \begin{small}
7881 \begin{alltt}
7882 <\textbf{rule}>
7883 <\textbf{pattern}>
7884 <\textbf{pattern-item} \textsl{n}="verb"/>
7885 <\textbf{/pattern}>
7886 <\textbf{action}>
7887 <\textbf{out}>
7888 <\textbf{lu}>
7889 <\textbf{clip} \textsl{pos}="1" \textsl{side}="tl" \textsl{part}="lemh"/>
7890 <\textbf{clip} \textsl{pos}="1" \textsl{side}="tl" \textsl{part}="a_verb"/>
7891 <\textbf{clip} \textsl{pos}="1" \textsl{side}="tl" \textsl{part}="temps"/>
7892 <\textbf{clip} \textsl{pos}="1" \textsl{side}="tl" \textsl{part}="persona"/>
7893 <\textbf{clip} \textsl{pos}="1" \textsl{side}="tl" \textsl{part}="gen"/>
7894 <\textbf{clip} \textsl{pos}="1" \textsl{side}="tl" \textsl{part}="nbr"/>
7895 <\textbf{clip} \textsl{pos}="1" \textsl{side}="tl" \textsl{part}="lemq"/>
7896 </\textbf{lu}>
7897 </\textbf{out}>
7898 </\textbf{action}>
7899 </\textbf{rule}>
7900 \end{alltt}
7901 \end{small}
7904 \label{regla_verbo2}
7905 \begin{small}
7906 \begin{alltt}
7907 <\textbf{rule}>
7908 <\textbf{pattern}>
7909 <\textbf{pattern-item} \textsl{n}="verb"/>
7910 <\textbf{pattern-item} \textsl{n}="prnenc"/>
7911 <\textbf{/pattern}>
7912 <\textbf{action}>
7913 <\textbf{out}>
7914 <\textbf{mlu}>
7915 <\textbf{lu}>
7916 <\textbf{clip} \textsl{pos}="1" \textsl{side}="tl" \textsl{part}="lemh"/>
7917 <\textbf{clip} \textsl{pos}="1" \textsl{side}="tl" \textsl{part}="a_verb"/>
7918 <\textbf{clip} \textsl{pos}="1" \textsl{side}="tl" \textsl{part}="temps"/>
7919 <\textbf{clip} \textsl{pos}="1" \textsl{side}="tl" \textsl{part}="persona"/>
7920 <\textbf{clip} \textsl{pos}="1" \textsl{side}="tl" \textsl{part}="nbr"/>
7921 </\textbf{lu}>
7922 <\textbf{lu}>
7923 <\textbf{clip} \textsl{pos}="2" \textsl{side}="tl" \textsl{part}="lem"/>
7924 <\textbf{clip} \textsl{pos}="2" \textsl{side}="tl" \textsl{part}="a_prnenc"/>
7925 <\textbf{clip} \textsl{pos}="2" \textsl{side}="tl" \textsl{part}="persona"/>
7926 <\textbf{clip} \textsl{pos}="2" \textsl{side}="tl" \textsl{part}="gen"/>
7927 <\textbf{clip} \textsl{pos}="2" \textsl{side}="tl" \textsl{part}="nbr"/>
7928 <\textbf{clip} \textsl{pos}="1" \textsl{side}="tl" \textsl{part}="lemq"/>
7929 </\textbf{lu}>
7930 </\textbf{mlu}>
7931 </\textbf{out}>
7932 </\textbf{action}>
7933 </\textbf{rule}>
7934 \end{alltt}
7935 \end{small}
7938 The first rule detects a verb and places the queue in the correct
7939 place, after all the grammatical symbols. The lexical unit is sent
7940 specifying the attributes separately: lemma head, lexical category
7941 (verb), tense, person, gender (for the participles), number and lemma
7942 queue.
7944 The second rule detects a verb followed by an enclitic pronoun and
7945 sends the two lexical forms specifying also the attributes separately;
7946 the first lexical unit consists of: lemma head, lexical category
7947 (verb), tense, person and number; the second lexical unit consists of:
7948 lemma, lexical category (enclitic pronoun), person, gender, number and
7949 lemma queue (of the first lexical form). This way, the queue of the
7950 lemma is placed after the enclitic pronoun. The two lexical units
7951 (verb and enclitic pronoun) are sent inside a \texttt{<mlu>} element,
7952 since they have to reach the morphological generator as a multilexical
7953 unit (multiword).
7956 Taking into account what we have explained here, if you want to
7957 \textbf{add a new transfer rule} you have to follow these steps:
7959 \begin{enumerate}
7961 \item Specify which pattern you want to detect. Bear in mind that
7962 words are processed only once by a rule, and that rules are applied
7963 left to right and choosing the longest match. For example, imagine
7964 you have in your transfer rules file only two rules, one for the
7965 pattern \emph{determiner--noun} and one for the pattern
7966 \emph{noun--adjective}. The Spanish phrase \emph{el valle verde}
7967 ("the green valley") would be detected and processed by the first one,
7968 not by the second. You will need to add a rule for the pattern
7969 \emph{determiner - noun - adjective} if you wish that the three
7970 lexical units are processed in the same pattern.
7972 \item Describe the operations you want to perform on the pattern. In
7973 the Apertium \texttt{es-ca}, \texttt{es-gl} and \texttt{es-pt}
7974 systems, simple agreement operations (gender and number agreement) are
7975 easy to perform in a rule by means of a macroinstruction. To perform
7976 other operations, you will need to use more complicated elements; for
7977 a more detailed description of the language used to create rules,
7978 refer to the section \ref{formatotransfer}.
7980 \item Send the lexical units of the pattern in the target language
7981 inside an \texttt{<out>} element. Each lexical unit must be included
7982 in a \texttt{<lu>} element. If two or more lexical units must be
7983 generated as a multilexical unit (only for enclitic pronouns in the
7984 present language pairs) , they must be grouped inside a \texttt{<mlu>}
7985 element.
7987 All the words that are detected by a rule (that are part of a pattern)
7988 must be sent out at the end of the rule so that the next module (the
7989 generator) receives them. If a lexical unit is detected by a pattern
7990 and is not included in the \texttt{<out>} element, it will not be
7991 generated.
7994 \end{enumerate}
7997 \section[Adding data for the part-of-speech tagger]{Adding data for
7998 the lexical categorial disambiguator (part-of-speech tagger)}
8000 The lexical categorial disambiguator takes the linguistic information
8001 needed to disambiguate a text basically from two sources: a tagset
8002 definition file and corpora. The tagset definition file is contained
8003 in the linguistic data directory and its name has the structure
8004 \texttt{apertium-PAIR.LANG.tsx}, whereas corpora information is
8005 contained in the \texttt{LANG-tagger-data} directory included in the
8006 previous directory.
8008 The \emph{tagset definition file} contains the definition of the
8009 coarse tags (or categories) used by the tagger when being trained and
8010 when disambiguating a text, as well as tag co-occurrence restrictions
8011 that help obtain better tag probabilities. In Section \ref{ss:tagger}
8012 you can find a detailed description of its characteristics.
8014 The \emph{corpora} that need to be in the \texttt{LANG-tagger-data}
8015 directory are different depending on whether the tagger is trained in
8016 a supervised way (with manually disambiguated text) or unsupervised
8017 (without manually disambiguated text):
8019 \begin{itemize}
8021 \item to train the tagger in a supervised way you need the files
8022 (examples from es-tagger-data): \texttt{es.tagged.txt},
8023 \texttt{es.untagged}, \texttt{es.tagged}, \texttt{es.dic}.
8025 \item to train the tagger in an unsupervised way you need the files
8026 (examples from es-tagger-data): \texttt{es.crp.txt}, \texttt{es.crp},
8027 \texttt{es.dic}
8029 \end{itemize}
8031 These files have the following characteristics:
8033 \begin{itemize}
8035 \item \texttt{es.tagged.txt}: A Spanish corpus in plain text format.
8036 \item \texttt{es.untagged}: The corpus \texttt{es.tagged.txt}
8037 morphologically analysed, which means, processed by the de-formatter
8038 and the morphological analyser (automatically generated corpus).
8039 \item \texttt{es.tagged}: The preceding corpus manually disambiguated.
8040 \item \texttt{es.crp.txt}: A large corpus (hundreds of thousands of
8041 words) used when training the tagger in an unsupervised way with
8042 Baum-Welch reestimation.
8043 \item \texttt{es.crp}: The preceding corpus processed consecutively by
8044 the de-formatter and the morphological analyser (automatically
8045 generated corpus).
8046 \item \texttt{es.dic}: File created from the Spanish monolingual
8047 dictionary \texttt{*.es.dix}, by means of the \texttt{lt-expand} and
8048 \texttt{aper\-tium\--fil\-ter\--am\-biguity} tools, which expand the
8049 dictionary and filter the ambiguity classes, so that the file contains
8050 all the forms identified as different ambiguity classes by the tagger
8051 defined with \texttt{*.es.tsx}; that is, which lexical categories can
8052 be homographs (automatically generated corpus).
8053 \end{itemize}
8055 When downloading Apertium from Sourceforge
8056 (\url{http://apertium.sourceforge.net/}), if the tagger has been
8057 trained in a supervised way, it is probable that you get the files
8058 needed for this kind of training, \texttt{es.tagged} and
8059 \texttt{es.tagged.txt} (for Spanish). The other required files are
8060 automatically generated when running the training. If the tagger has
8061 been trained in an unsupervised way, you will not get any corpus in
8062 the download since the files required for this kind of training are
8063 huge. If you wish to train the tagger with this method, you will need
8064 to collect a large corpus and name it \texttt{es.crp.txt}. The other
8065 required files are automatically generated when running the training.
8067 Anyway, the Apertium translator comes with all the data required for a
8068 good performance of the tagger. You don't need to train the tagger in
8069 order to use Apertium. A retraining might be required in the case that
8070 you have made really extensive changes to the dictionaries or you have
8071 modified the tagset definition file.
8073 Therefore, the tagger data can be modified in two ways:
8075 \begin{enumerate}
8077 \item Change the tagset definition file. You can add, change or delete
8078 the coarse tags used by the tagger, if you think that a new category
8079 could be useful for the disambiguation or that a certain category
8080 should be modified to obtain better results. You can also add
8081 restrictions (for example, you can forbid the sequence
8082 determiner--determiner if this is an impossible combination in a given
8083 language and can help in the disambiguation of certain homograph
8084 words).
8086 \item Modify the corpora used to train the tagger. You can modify the
8087 manually disambiguated text (\texttt{es.tagged} for Spanish) if you
8088 think that certain tags have been wrongly selected. You can also add
8089 sentences to this text (and to \texttt{es.tagged.txt}, used to automatically
8090 generate the corpus \texttt{es.untagged}) in order to
8091 add information to the tagger, since it is possible that certain
8092 combinations are incorrectly disambiguated because the tagger has not
8093 found them in the training corpora.
8096 \end{enumerate}
8098 There are two commands to run the training:
8100 \begin{itemize}
8102 \item to train in a supervised way, type, in the directory containing
8103 the linguistic data (example for \emph{es}--\emph{ca}): \texttt{make
8104 -f es-ca-supervised.make}
8107 \item to train in an unsupervised way, type, in the directory
8108 containing the linguistic data (example for \emph{es}--\emph{ca}):
8109 \texttt{make -f es-ca-unsupervised.make}
8112 \end{itemize}
8114 In both cases, planned files will be automatically generated.
8117 \section{Detecting errors}
8118 \label{errores}
8121 It is easy to make errors when adding new words or transfer rules to
8122 the Apertium system.
8124 On the one hand, it is possible that, when compiling the new files,
8125 the system displays an error message. In this case, this is a formal
8126 error (a missing XML tag, a tag that is not allowed in a certain
8127 context, etc.). You just have to go to the line number indicated by
8128 the error message, correct the error and compile again. On the other
8129 hand, there are other types of errors not detected when compiling, but
8130 which can make the system mistranslate a word or give an
8131 incomprehensible text string. These are linguistic errors, which can
8132 be detected and corrected with the tips given in this chapter. The
8133 following information is for Linux users, since Apertium works for the
8134 moment only in this operating system.\footnote{There are in
8135 \url{http://apertium.org} experimental packages for Windows with fixed
8136 linguistic data (non-modifiable binary files).}
8138 \subsection{Adjusting error symbols}
8139 \label{subsec:marcaserror}
8141 When the system encounters a problem to translate any word of a source
8142 language text, in the default mode the system outputs the problematic
8143 word together with a symbol that indicates that an error has occurred.
8144 The meaning of the different symbols is the following:
8148 \begin{itemize}
8151 \item '\verb!@!': The problem is in the lexical transfer module, which
8152 can not translate the lexical form (the bilingual dictionary does not
8153 contain it)
8155 \item '\verb!#!': The problem has occurred in the generator, which can
8156 not generate the surface form from the input lexical form (the
8157 morphological dictionary does not contain it in the generation
8158 direction)
8160 \item '\verb!/!': This symbol separates two or more surface forms
8161 delivered by the generator. The problem, therefore, is in the target
8162 language monolingual dictionary, which has, in the generation
8163 direction, two surface forms for a single lexical form, when it should
8164 have only one.
8167 \end{itemize}
8170 The generation module has three modes, which enable us to decide how
8171 errors will be displayed in the final output. The three possible
8172 parameters are:
8174 \begin{itemize}
8176 \item -n : error symbols and the unknown-word symbol will NOT be
8177 displayed, and neither will any grammatical symbols
8179 \item -g : error symbols and the unknown-word symbol will be displayed
8180 (default mode)
8182 \item -d : error symbols and the unknown-word symbol will be
8183 displayed, as well as the grammatical symbols of the lexical forms
8184 producing the error.
8187 \end{itemize}
8190 The preferable mode depends on the type of user and on the translation
8191 purpose. The first option is the most suitable when the user does not
8192 want that external signs interfere in the reading of the
8193 translation. The second option is useful when the user wants the
8194 system to show where there has been a problem in the translation
8195 (errors or unknown words) in order to be able to post-edit it
8196 easily. The third option is ideal for linguistic developers of
8197 Apertium, since it displays all the linguistic information of the
8198 forms that produced an error.
8200 Taking advantage of the error symbols output by the system, it is
8201 possible to carry out a thorough test of the dictionaries of a certain
8202 language pair. This will enable you to detect and correct all its
8203 errors. To learn how to do it, see Section \ref{integridad}.
8205 \subsection{Output of the different Apertium modules}
8207 Sometimes it is difficult to find the origin of an error. In such
8208 cases, it is useful to see the output of each of the modules. As all
8209 the data processed by the system, from the original text to the
8210 translated text, circulate between the eight modules of the system in
8211 text format, it is possible to stop the text stream at any point to
8212 know what is the input or the output of a certain module.
8214 Using a pipeline structure and the \texttt{echo} or \texttt{cat}
8215 commands, you can send a text through one or more modules to analyse
8216 their output and detect the origin of the error. We describe next how
8217 to do it. You have to move to the directory where the linguistic data
8218 are saved and type the described commands.
8222 \subsubsection{The morphological analyser output}
8224 To know how a word is analyzed by the translator, type the following
8225 in the terminal (example for the Catalan word \emph{sabates}):
8228 \begin{small}
8229 \begin{alltt}
8230 echo "sabates" | apertium-destxt | lt-proc ca-es.automorf.bin
8231 \end{alltt}
8232 \end{small}
8234 You can replace \texttt{ca-es} with the translation direction you want
8235 to test.
8237 The output in Apertium should be:
8238 \begin{small}
8239 \begin{alltt}
8240 ^sabates/sabata<n><f><pl>\$^./.<sent>\$[][]
8241 \end{alltt}
8242 \end{small}
8244 The string structure is
8245 \verb!^!\texttt{word/lemma<}\textsl{morphological
8246 analysis}\texttt{>}\verb!$!. The \texttt{<sent>} tag is the analysis
8247 of the full stop, as every sentence end is represented as a full stop
8248 by the system, whether or not explicitly indicated in the sentence.
8250 The analysis of an unknown word is (ignoring the full stop info):
8252 \begin{small}
8253 \begin{alltt}
8254 ^genoma/*genoma\$
8255 \end{alltt}
8256 \end{small}
8258 \noindent and the analysis of an ambiguous word:
8260 \begin{small}
8261 \begin{alltt}
8262 ^casa/casa<n><f><sg>/casar<vblex><pri><p3><sg>/casar<vblex><imp><p2><sg>\$
8263 \end{alltt}
8264 \end{small}
8266 Each lexical form (lemma plus morphological analysis) is presented as
8267 a possible analysis of the word \emph{casa}.
8269 \subsubsection{The tagger output}
8272 To know the output of the tagger for a source language text, type the
8273 following in the terminal (example for the Catalan-Spanish direction):
8275 \begin{small}
8276 \begin{alltt}
8277 echo "sabates" | apertium-destxt | lt-proc ca-es.automorf.bin \\|apertium-tagger -g ca-es.prob
8278 \end{alltt}
8279 \end{small}
8281 The output will be:
8282 \begin{small}
8283 \begin{alltt}
8284 ^sabata<n><f><pl>\$^./.<sent>\$[][]
8285 \end{alltt}
8286 \end{small}
8288 The output for an ambiguous word will be like the one above, since the
8289 tagger chooses one lexical form among all the
8290 possibilities. Therefore, the output for \emph{casa} in Catalan will
8291 be, for example (depending on the context):
8293 \begin{small}
8294 \begin{alltt}
8295 ^casa<n><f><sg>\$^.<sent>\$[][]
8296 \end{alltt}
8297 \end{small}
8299 \subsubsection{The \texttt{pretransfer} output}
8301 This module applies some changes to multiwords (move the lemma queue
8302 of a multiword with inner inflection just after the lemma head). To
8303 know its output, type:
8305 \begin{small}
8306 \begin{alltt}
8307 echo "sabates" | apertium-destxt | lt-proc ca-es.automorf.bin \\|apertium-tagger -g ca-es.prob | apertium-pretransfer
8308 \end{alltt}
8309 \end{small}
8311 Since \emph{sabates} is not a multiword, this module does not alter
8312 its input.
8314 \subsubsection{The structural and lexical transfer output}
8316 To know how a word, phrase or sentence is translated into the target
8317 language and processed by structural transfer rules, type the
8318 following in the terminal:
8319 \begin{small}
8320 \begin{alltt}
8321 echo "sabates" | apertium-destxt | lt-proc ca-es.automorf.bin \\|apertium-tagger -g ca-es.prob | apertium-pretransfer \\| ./ca-es.transfer ca-es.autobil.bin
8322 \end{alltt}
8323 \end{small}
8325 The output for this word will be:
8327 \begin{small}
8328 \begin{alltt}
8329 ^zapato<n><m><pl>\$^.<sent>\$[][]
8330 \end{alltt}
8331 \end{small}
8334 Analysing how a word or phrase is output by this module can help you
8335 detect errors in the bilingual dictionary or in the structural
8336 transfer rules. Typical bilingual dictionary errors are: two
8337 equivalents for the same source language lexical form, or wrong
8338 assignment of grammatical symbols. Errors due to structural transfer
8339 rules vary a lot depending on the actions performed by the rules.
8342 \subsubsection{The morphological generator output}
8344 To know how a word is generated by the system, type the following in
8345 the terminal:
8347 \begin{small}
8348 \begin{alltt}
8349 echo "sabates" | apertium-destxt | lt-proc ca-es.automorf.bin \\|apertium-tagger -g ca-es.prob | apertium-pretransfer \\| ./ca-es.transfer ca-es.autobil.bin | ltproc -g ca-es.autogen.bin
8350 \end{alltt}
8351 \end{small}
8353 With this command you can detect generation errors due to an incorrect
8354 entry in the target language monolingual dictionary or to a divergence
8355 between the output of the bilingual dictionary (the output of the
8356 previous module) and the entry in the monolingual dictionary.
8358 The correct output for the input \emph{sabates} would be:
8360 \begin{small}
8361 \begin{alltt}
8362 zapatos.[][]
8363 \end{alltt}
8364 \end{small}
8366 There are in this step no grammatical symbols, and the word appears
8367 inflected.
8369 \subsubsection{The post-generator output}
8371 It is not very usual to have errors due to the post-generator, because
8372 of its generally small size and the fact that it is seldom changed
8373 after adding usual combinations, but you can also test how a source
8374 language text comes out of this module, by typing:
8376 \begin{small}
8377 \begin{alltt}
8378 echo "sabates" | apertium-destxt | lt-proc ca-es.automorf.bin \\|apertium-tagger -g ca-es.prob | apertium-pretransfer \\| ./ca-es.transfer ca-es.autobil.bin | ltproc -g ca-es.autogen.bin \\| ltproc -p es-ca.autopgen.bin
8379 \end{alltt}
8380 \end{small}
8382 \subsubsection{The Apertium output}
8384 You can put all the modules of the system in the pipeline structure
8385 and see how a source language text goes through all the modules and
8386 gets translated into the target language. You just have to add the
8387 re-formatter to the previous command:
8389 \begin{small}
8390 \begin{alltt}
8391 echo "sabates" | apertium-destxt | lt-proc ca-es.automorf.bin \\|apertium-tagger -g ca-es.prob | apertium-pretransfer \\| ./ca-es.transfer ca-es.autobil.bin | ltproc -g ca-es.autogen.bin \\| ltproc -p es-ca.autopgen.bin | apertium-retxt
8392 \end{alltt}
8393 \end{small}
8395 This is the same as using the \texttt{apertium-translator} shell
8396 script provided by the Apertium package:
8398 \begin{small}
8399 \begin{alltt}
8400 echo "sabates" | apertium-translator . ca-es
8401 \end{alltt}
8402 \end{small}
8404 \noindent (The dot indicates the directory where the linguistic data
8405 are saved, in this case the current directory).
8407 Of course, instead of typing all the presented commands every time you
8408 need to test a translation, you can create shell scripts for every
8409 action and use them to test the output of each module.
8414 \subsection{Error examples}
8417 1) We can get the following kind of output in a translation:
8419 \begin{small}
8420 \begin{alltt}
8421 \$ echo "nord" | apertium-translator . ca-es
8422 \$ #norte<n><m><sg>
8423 \end{alltt}
8424 \end{small}
8426 This means that the word was correctly translated by the bilingual
8427 dictionary but that the system does not find it in the Spanish
8428 morphological dictionary to generate it. The problem can be in the
8429 morphological dictionary but can also be caused by an incorrect
8430 bilingual entry, in which the grammatical symbols that the translated
8431 word is assigned do not correspond with the grammatical symbols that
8432 this word has in the morphological dictionary.
8434 2) The following \texttt{es-ca} bilingual entry does not take into
8435 account the gender change between \emph{adhesiu} (masculine) and
8436 \emph{pegatina} (feminine), causing the translator to give an error:
8438 \begin{small}
8439 \begin{alltt}
8440 <\textbf{e}>
8441 <\textbf{p}>
8442 <\textbf{l}>pegatina<\textbf{s} \textsl{n}="n"/></\textbf{l}>
8443 <\textbf{r}>adhesiu<\textbf{s} \textsl{n}="n"/></\textbf{r}>
8444 </\textbf{p}>
8445 </\textbf{e}>
8446 \end{alltt}
8447 \end{small}
8449 \begin{small}
8450 \begin{alltt}
8451 \$ echo "adhesiu" | apertium-translator . ca-es
8452 \$ #pegatina<n><m><sg>
8453 \end{alltt}
8454 \end{small}
8456 The correct entry should be:
8458 \begin{small}
8459 \begin{alltt}
8460 <\textbf{e}>
8461 <\textbf{p}>
8462 <\textbf{l}>pegatina<\textbf{s} \textsl{n}="n"/><\textbf{s} \textsl{n}="f"/></\textbf{l}>
8463 <\textbf{r}>adhesiu<\textbf{s} \textsl{n}="n"/><\textbf{s} \textsl{n}="m"/></\textbf{r}>
8464 </\textbf{p}>
8465 </\textbf{e}>
8466 \end{alltt}
8467 \end{small}
8469 3) The following error is given when the source language lexical form
8470 can not be found in the bilingual dictionary, either because there is not an entry for this
8471 lemma or because the entry does not correspond with the grammatical
8472 symbols received from the analyser:
8475 \begin{small}
8476 \begin{alltt}
8477 \$ echo "illot" | apertium-translator . ca-es
8478 \$ @illot<n><m><sg>
8479 \end{alltt}
8480 \end{small}
8482 4) When a source language lexical form has two correspondences in the
8483 bilingual dictionary, the translator output is like the following one:
8485 \begin{small}
8486 \begin{alltt}
8487 \$ echo "llavor" | apertium-translator . ca-es
8488 \$ #pepita<n>/semilla<n><m><sg>
8489 \end{alltt}
8490 \end{small}
8492 The solution is to put a direction restriction in one of the bilingual
8493 entries.
8496 Some errors can be due to structural transfer rules. The way to solve
8497 a problem whose origin we don't know, is to test the output of the
8498 different modules to detect where the problem arises.
8500 \subsection{Testing the integrity of the
8501 dictionaries}\label{integridad}
8503 It is highly advisable to test the integrity of our dictionaries from
8504 time to time, especially if we changed them significantly --or if we
8505 changed the transfer rules, because some errors can be due to its
8506 application.
8508 The test is carried out in one translation direction. For this reason,
8509 for a given language pair, you will have to perform two tests, one in
8510 each direction.
8512 The steps you have to follow to perform the test are:
8514 \begin{itemize}
8516 \item expand the source language monolingual dictionary, using the
8517 \texttt{lt-expand} tool, to obtain all the lexical forms (which are
8518 the forms that appear on the right of the colon in the output file);
8520 \item send these lexical forms (except those that are only generation
8521 forms, which \texttt{lt-expand} will have marked with the symbol
8522 '\texttt{<}' ) through all the system modules from pretransfer to the
8523 generator;
8525 \item Search in the result, the lexical forms marked with the symbols
8526 '\texttt{\#}' , '\texttt{@}' or '\texttt{/}', which will be the error
8527 forms (see Section~\ref{subsec:marcaserror}).
8530 \end{itemize}
8535 \section{Generating a new Apertium system from modified data}
8537 If you make changes to any of the linguistic data files of Apertium
8538 (dictionaries, transfer rules or tagger definition file), the changes
8539 will not be applied until you recompile the modules. To do this, type
8540 \texttt{make} in the directory where the linguistic data are saved so
8541 that the system generates the new binary files.
8543 If changes were made to the tagger definition file or to the corpora
8544 used to train the tagger, you will need also to retrain the tagger: in
8545 the same linguistic data directory, you have to type (example for the
8546 Spanish tagger in the es-ca translator) \texttt{make -f
8547 es-ca-unsupervised.make} for unsupervised training or \texttt{make -f
8548 es-ca-supervised.make} for supervised training.
8550 After compilation, \texttt{apertium-translator} will already use the
8551 new data.
8553 \newpage
8556 \chapter{Data insertion web forms}
8560 This chapter describes the dictionary maintaining system in Apertium
8561 2. It is organized in two sections. Section \ref{ss:formadmin} gives
8562 the necessary information to install and adjust the web application
8563 for word insertion. Section \ref{ss:formus} describes how to use the
8564 tool to add linguistic data.
8567 \section{Introduction}
8569 Adding lemmas to the dictionaries of the different languages in
8570 Apertium is a slow task if you do it by manually editing the XML
8571 dictionaries; for this reason web forms have been created, which make
8572 the word insertion task considerably easier and, furthermore, allow
8573 the users to do it remotely from any computer with Internet access.
8575 The tool consists of a set of forms written in \texttt{php} which can
8576 be used from any Internet navigator, either locally in the same
8577 computer where dictionaries are saved, or remotely.
8579 \section{Installing and managing}
8580 \label{ss:formadmin}
8582 \subsection{Installing the tool}
8584 The installation must be done in a Unix machine which has an Apache
8585 web server with \texttt{php} installed. So, you will first need to
8586 install the \texttt{php} server if it is not installed, and then
8587 proceed to install the form tool.
8590 To install the tool, download the package
8591 \textit{`apertium-lexical-webform-0.9'} from the Apertium web page in
8592 Sourceforge (\url{http://apertium.sourceforge.net/}) and unpack it in
8593 the directory where you want to leave the tool.
8596 \begin{alltt}
8597 # cd /path/to the /forms tar -xvzf
8598 # /path/apertium-lexical-webform-0.9.tar.gz
8599 \end{alltt}
8601 You must take into account that Apache only serves the pages that are
8602 in the root directory that we configured. Therefore, the directory
8603 where you place the forms must be a subdirectory inside the root
8604 directory of the Apache server.
8606 Next, you have to edit the configuration file, which you can find in
8607 \textit{private/config.php}, and give the appropriate values to the
8608 configuration variables:
8610 \begin{itemize}
8611 \item \texttt{\$anmor}: entire path of the morphological analyser
8612 \texttt{lt-proc}.
8613 \item \texttt{\$dicos\_path}: path to the directory where the final
8614 dictionaries and the compiled binaries of each dictionary are
8615 saved. This directory must contain a subdirectory for each dictionary
8616 with which the form can work. The subdirectory name must have the
8617 following structure: \texttt{paradigmes-ll-rr} , where \textit{ll} and
8618 \textit{rr} are the initials of the language pair involved. Each
8619 directory must contain the final dictionaries used by the machine
8620 translation system and the corresponding compiled binaries. These
8621 directories can be replaced with symbolic links in the case that they
8622 are located in a different place.
8623 \item \texttt{\$usuaris\_professionals}: a list of the professional
8624 users in the system that have permission to insert words in the form
8625 dictionaries and to validate entries pending confirmation.
8627 \item \texttt{\$mail}: E-mail address of the administrator of the
8628 forms. When someone wants to register as a user, an e-mail will be
8629 sent to this address.
8630 \end{itemize}
8632 Once the parameters of this file have been configured, the forms
8633 server is already in use.
8636 \subsection{Directory structure}
8638 All the files required by the application are structured as follows:
8640 \begin{itemize}
8641 \item \texttt{/index.php:} displays the initial insertion form. It
8642 has a section for each language pair, where the user inserts the SL
8643 lemma and the TL lemma and chooses the appropriate part of
8644 speech. After pressing the \textit{'Go on'} button, the next page is
8645 displayed, where the user has to select the appropriate inflection
8646 paradigms for the SL lemma and the TL lemma.
8647 \item \texttt{/dics:} directory that contains the dictionaries with
8648 the entries inserted from the forms. It contains the files with the
8649 entries from non-professional users (pending validation) and the
8650 dictionaries with the \texttt{XML} entries from professional users.
8651 \item \texttt{/private:} most modules used in the forms are saved
8652 here. It contains also the directories with the definition of
8653 paradigms for all the languages of the forms; these directories have
8654 the name \texttt{paradigmes-ll-rr}, where \textit{ll} and \textit{rr}
8655 are the initials of a given language pair. The order chosen for the
8656 two languages, first \textit{ll} and then \textit{rr}, depends on the
8657 order defined for entries in the bilingual dictionary. This directory
8658 contains also the files that carry out the whole processing of the
8659 words being inserted. These files are:
8660 \begin{itemize}
8661 \item \texttt{resultado.php: } This \textit{php} is called when two
8662 words for any language pair are inserted from the module
8663 \textit{index.php}. Basically, what it does is to establish the
8664 language pair involved (\textit{\$LR} and \textit{\$RL}) and the
8665 part of speech of the words being inserted (\textit{\$tipus}). It is
8666 included in the \textit{selec.php} module, that is the next one
8667 called in the insertion process. In the case that the \textit{tipus}
8668 (\textit{type}) of the word being inserted is a multiword unit
8669 (\textit{Multi Word Verb}), then \textit{multip.php} is the module
8670 included and called instead of \textit{selec.php}. The \textit{Multi
8671 Word Verb} elements consist of a verb that can inflect followed by
8672 an invariable queue of one or more words (see Section
8673 \ref{ss:multipalabras} for a detailed description).
8674 \item \texttt{selecc.php: } This module is in charge of the selection
8675 of paradigms for the pair of words, the SL word and the TL word. It
8676 displays a list of paradigms to be chosen from, which depends on the
8677 part of speech of the entry being inserted. When a new paradigm is
8678 selected for a lemma, it displays some examples of inflected forms of
8679 the lemma according to the chosen paradigm. If the user accepts the
8680 chosen paradigms, the module calls \textit{insertarPro.php} or
8681 \textit{insertar.php} depending on whether the user is professional or
8682 non-professional respectively.
8683 \item \texttt{multip.php: } It has the same function as the
8684 \textit{selecc.php} module but for multiword units. It uses the same
8685 variables and performs the same actions, but in the examples
8686 displayed, the verb is inflected and the words of the queue are added
8687 after it. It works in an analogous way as the \textit{selecc.php}
8688 module, whose detailed description can be found in Section
8689 \ref{ss:fitxersphp}.
8690 \item \texttt{valida.php: } This module is called when a professional
8691 user wants to validate words that are in the queue of entries
8692 pending validation. It consults the file of words to be validated
8693 reading them one by one; it takes the data of the entry in turn
8694 (\textit{LRlem, RLlem, paradigmaLR, paradigmaRL, LR, RL}, etc.) and
8695 calls \textit{selecc.php} to continue with the insertion process of
8696 that specific entry.
8697 \item \texttt{insertarPro.php: } This module is called when the
8698 paradigms for the SL word and the TL word have already been selected
8699 (which was done in \textit{selecc.php}), and displays what the
8700 resulting \texttt{XML} entries will look like for the three
8701 dictionaries (SL monolingual, bilingual and TL monolingual) . From
8702 this screen it is possible to directly modify the code, and finally to
8703 accept the new entry or to cancel the operation.
8704 \item \texttt{ins\_multip.php: } It has the same function as
8705 \textit{insertarPro.php} but it is designed for multiword entries,
8706 therefore, the entry is treated differently so that the inserted
8707 \texttt{XML} code is correct.
8708 \item \texttt{insertar.php: } This module is equivalent to
8709 \textit{insertarPro.php} but for non-professional users. The actions
8710 it performs are much more simple, since the module just adds the
8711 lemmas and the paradigms selected by the non-professional user to the
8712 file of words to be validated; they remain in this file until a
8713 professional user validates them.
8714 \item \texttt{verSemi.php: } This module displays the file of entries
8715 inserted by non-professional users which are waiting for
8716 validation. It is useful for professional users who, before starting
8717 validating words, want to see which words are in the queue waiting for
8718 validation. It can be called from a link displayed in the form
8719 generated by \textit{selec.php}.
8720 \item \texttt{paradigmas.xsl:} Style sheet used to generate the
8721 paradigm files that are used by the form modules. It is used with the
8722 specification of paradigms of a language written in \texttt{XML}
8723 format. This question will be explained in more detail in Section \ref{paradigm}
8724 \textit{Paradigm files}.
8725 \item \texttt{creaparadigma.awk:} \texttt{awk} file used also to
8726 generate the mentioned paradigm files.
8727 \item \texttt{gen\_paradig.sh:} Script that can be used if we want to
8728 generate automatically the paradigm files for all the language pairs
8729 installed in our system.
8730 \end{itemize}
8731 \end{itemize}
8733 In the next sections you will find a detailed description of the tasks
8734 of each module.
8736 \subsection{Php files}
8738 \subsubsection{resultado.php}
8740 Depending on the value of the variable \texttt{\$nomtrad} updated by
8741 \textit{index.php}, the module assigns the appropriate values to
8742 \texttt{\$LR} and \texttt{\$RL} (source language and target language
8743 respectively). Then, according to the part of speech of the word being
8744 inserted, the variable \$tipo is assigned the appropriate value, and
8745 then \textit{selec.php} or \textit{multip.php} are called depending on
8746 whether the word is a simple unit or a multiword unit. \nota{MG:
8747 ``asignamos'' i ``llamamos'' no seria més aviat ``se asigna'' y ``se
8748 llama''?}
8750 \subsubsection{selecc.php}
8751 \label{ss:fitxersphp}
8753 The function of this module is the selection of a paradigm for the
8754 words being inserted. The user will have to select a paradigm for the
8755 SL word and another one for the TL word.
8757 There are a group of variables which, depending on the part of speech
8758 of the word, are assigned certain values that will be used at the end
8759 \nota{MG: "que darrerament s'utilitzaran" vol dir 'que s'utilitzaran
8760 al final'?}; these variables are:
8761 \begin{itemize}
8762 \item \texttt{cadFich:} part of speech of the lemma.
8763 \item \texttt{show:} string displayed in the form that indicates the
8764 part of speech of the word being inserted.
8765 \item \texttt{tag:} string with the \texttt{XML} tag output by the
8766 morphological analyser for this part of speech.
8767 \item \texttt{tagout:} string with the \texttt{XML} code that shows
8768 the part of speech of the word. This string will be used when building
8769 the final \texttt{XML} entry that will be inserted in the dictionary.
8770 \item \texttt{nota:} string with possible comments to be inserted in
8771 the \texttt{XML} code of the entry.
8772 \end{itemize} Forms work with 4 kinds of dictionaries:
8773 \begin{itemize}
8774 \item \textit{Semi-professional dictionaries}: They contain the words
8775 inserted from the form by non-professional users and which are pending
8776 validation. Their extension is "\textit{semi.dic}"
8777 \item \textit{Form dictionaries}: They contain the words inserted from
8778 the form by professional users, and also the ones that have been
8779 validated from the semi-professional dictionaries. Their extension is
8780 "\textit{webform}".
8781 \item \textit{Final dictionaries}: The files with all the entries
8782 written in \texttt{XML} code. These are the files finally used by the
8783 translator after being compiled. Their extension is "\textit{dix}".
8784 \item \textit{Final compiled dictionaries}: These are the compiled
8785 final dictionaries, which can already be used by the binaries of the
8786 translator. Their extension is "\textit{bin}"
8787 \end{itemize}
8789 All these dictionaries are used by the forms; there are variables that
8790 contain the paths to them. Values are also assigned to variables that
8791 manage the paths to the auxiliary and the configuration files:
8792 \begin{itemize}
8793 \item \texttt{path:} path to the temporary dictionaries.
8794 \item \texttt{fich\_LR:} source language dictionary with the words
8795 inserted from the form that are not yet in the final dictionary nor in
8796 the compiled dictionary.
8797 \item \texttt{fich\_RL:} target language dictionary with the words
8798 inserted from the form that are not yet in the final dictionary nor
8799 in the compiled dictionary. \nota{MG: I don't like speaking of SL and
8800 TL dictionaries, entries are for both directions, I think this is
8801 confusing. It should be changed in the whole chapter.}
8802 \item \texttt{fich\_LRRL:} bilingual dictionary with the words
8803 inserted from the form that are not yet in the final dictionary nor in
8804 the compiled dictionary.
8805 \item \texttt{fich-semi:} entries inserted from the form by
8806 non-professional users and which are pending validation.
8807 \item \texttt{path\_paradigmasLR:} path to the files that contain the
8808 inflection paradigms of the source language.
8809 \item \texttt{path\_paradigmasRL:} path to the files that contain the
8810 inflection paradigms of the target language.
8811 \item \texttt{anmor:} path to the morphological analyser.
8812 \item \texttt{aut\_LRRL:} path to the bilingual binary from source
8813 language to target language.\nota{MG: the original said "binario
8814 morfológico", I think it's an error, I wrote 'bilingual binary'}
8815 \item \texttt{aut\_RLLR:} path to the bilingual binary from target
8816 language to source language.\nota{MG: ídem ("bilingual").}
8817 \end{itemize}
8819 Then the html code is inserted with the operations to be performed
8820 depending on the selected action. The actions performed by the module
8821 are the following, in sequential order:
8823 \begin{itemize}
8824 \item Tests that the source language lemma being inserted is not
8825 already in the dictionaries containing the words inserted from the
8826 form. If \texttt{selecc.php} has been called from the word validation
8827 screen (\texttt{valida.php}), then the module tests that the lemma is
8828 not already in the file of words inserted by non-professional
8829 users. It tests this also in the final dictionary.
8830 \item Performs the same test for the target language.
8831 \item Code is written to select translation direction restrictions.
8832 \item A series of functions are defined, which will be used when
8833 generating the examples for the lemmas after the selection of the
8834 appropriate paradigm. These are:
8835 \begin{itemize}
8836 \item \texttt{esVocalFuerte}
8837 \item \texttt{esVocalDebil}
8838 \item \texttt{esVocal}
8839 \item \texttt{PosicioVocalTall}
8840 \end{itemize} These functions are described later in section
8841 \ref{insertarpro}.
8842 \item The paradigm file is opened to display a drop-down box with the
8843 paradigms that can be selected for the source language lemma. To do
8844 this, the program has to test sequentially the paradigms defined for
8845 the part of speech of the lemma, checking whether the paradigm can be
8846 applied to the lemma in question.
8847 \item Then the same is done with the paradigms for the target language
8848 lemma.
8849 \item After the lemmas and the corresponding paradigms have been
8850 selected, examples must be generated to show how these lemmas would
8851 be inflected according to the selected paradigms. To do this, we
8852 need the root of the lemma (\texttt{raiz\_LR and raiz\_RL}), as well
8853 as the example endings for the selected paradigm
8854 (\texttt{paradigma\_LR and paradigma\_RL}); these endings are
8855 obtained from the paradigm file. Finally, a string is build
8856 containing the generated examples (\texttt{ejemplos\_LR and
8857 ejemplos\_RL}), and these are displayed.
8858 \item If we arrived to this screen because we were validating words
8859 (\texttt{va\-li\-da=1}), then a button is added to the form, which
8860 allows us to delete the current entry if we decide not to validate it.
8861 \item If the user that arrived to this screen is a professional user,
8862 then a button is added to the form, which allows the user to select
8863 the option for the validation of words entered by non-professional
8864 users.
8865 \item Finally, after one of the action buttons located at the bottom
8866 of the form is pressed, the applicable actions are performed. If the
8867 chosen action is \textit{"Delete"}, which can only be the case if the
8868 user is validating entries, the current entry is deleted from the file
8869 of entries made by non-professional users. If the chosen action is a
8870 confirmation (\textit{"Go on"} button), the module
8871 \texttt{insertarPro.php} or \texttt{insertar.php} is called, depending
8872 on whether the user is professional or non-professional respectively.
8873 These modules are in charge of inserting the words in the
8874 dictionaries.
8875 \end{itemize} After the entry has been inserted, the page
8876 \texttt{va\-li\-dar.php} or the page \texttt{selecc.php} are displayed
8877 again, depending on whether the user was doing a validation process
8878 (and then \textit{valida=1}) or a normal insertion.
8880 \subsubsection{multip.php}
8882 The code and behaviour of this module is the same as
8883 \textit{selecc.php}. The only difference is that this module is
8884 designed for managing multiword units, whereas \textit{selec.php}
8885 manages the rest of units. Therefore, the main difference is the
8886 existence of the variables \texttt{\$LRcua} and \texttt{\$RLcua},
8887 which contain the invariable queue that comes after the variable part
8888 of a multiword. When the examples are displayed, besides showing the
8889 variable part inflected according to the selected paradigm, also and
8890 editable text box is displayed with the invariable queue.
8892 When the button to continue with the insertion of the entry in the
8893 dictionaries is pressed, the module \textit{ins\_multip.php} is called
8894 instead of \textit{insertarPro.php}.
8897 \subsubsection{valida.php}
8899 This module is called when a professional user presses the button
8900 "\textit{validate pairs}". It reads the dictionary of entries pending
8901 validation (\$fichSemi) for the applicable language pair. Then, the
8902 module enters a loop that goes through this file and reads the entries
8903 one by one. With the information of a given entry, it assigns values
8904 to a set of variables that will be used in the modules that will
8905 perform the subsequent actions. These variables are, for example:
8906 \begin{center}
8907 % use packages: array
8908 \begin{tabular}{ll}
8909 \$LRlem & \$RLlem \\
8910 \$paradigmaLR & \$paradigmaRL \\
8911 \$direccions & \$tipo \\
8912 \$comentarios & \$user \\
8913 \$geneLR & \$geneRL \\
8914 \$numLR & \$numRL \\
8915 \$LR & \$RL
8916 \end{tabular}
8917 \end{center}
8919 Once the appropriate values for these variables have been established,
8920 the module \textit{selec.php} comes into action and treats the entries
8921 as if they were made by a professional user. After inserting the
8922 entries in the dictionaries by means of \textit{insertarPro.php}, the
8923 flow returns to \textit{valida.php}, which proceeds to the next entry
8924 to be validated.
8926 \subsubsection{insertarPro.php}
8927 \label{insertarpro}
8929 After the lemmas have been entered and their paradigms selected in
8930 \textit{selec.php}, this is the module that generates the
8931 corresponding \texttt{XML} entries and inserts them in the monolingual
8932 dictionaries and the bilingual dictionary.
8934 It performs many operations similar to those performed in
8935 \textit{selec.php}, such as generating the examples for the inflected
8936 word. Thus, firstly, it gives values to \texttt{cadFich, show, tag,
8937 tagout, nota} depending on the part of speech (\texttt{\$tipus}) of
8938 the word being inserted. It assigns paths to the file location
8939 variables and defines some required functions as occurred in
8940 \textit{selec.php}.
8941 \begin{itemize}
8942 \item \texttt{esVocalFuerte}: Returns \textit{true} if the vowel is
8943 strong, that is, \textit{a, e, o}.
8944 \item \texttt{esVocalDebil}: Returns \textit{true} if the vowel is
8945 weak, that is \textit{i, u}.
8946 \item \texttt{esVocal}: Returns \textit{true} if the character passed
8947 as an argument is a vowel.
8948 \item \texttt{diptongo}: Returns \textit{true} if the two letters
8949 passed as an argument make a diphthong. This will be the case when at
8950 least one of the two vowels is not strong.
8951 \item \texttt{acentuar}: It receives a text string and accentuates it
8952 according to the Spanish accentuation rules, depending on the
8953 parameter \textit{\$siguienteletra}. \nota{MG: only for Spanish?}
8954 \item \texttt{esMayuscula}: Returns \textit{true} if the character is
8955 in upper case.
8956 \item \texttt{TieneAcento}: Returns \textit{true} if the string has an
8957 accent.
8958 \item \texttt{acentua}: Accentuates the last accentuable vowel of a
8959 word with an open or closed accent, depending on the direction
8960 specified in the parameter \$sentit.\nota{MG: then not only for
8961 Spanish but also for Catalan or Occitan?}
8962 \item \texttt{PonQuitaAcento}: Inserts or removes the accent of the
8963 first string passed as an argument depending on whether the second
8964 string passed as an argument has an accent or not.
8965 \item \texttt{PosicioVocalTall}: Returns the position in the lemma
8966 (\$lema) for the vowel (\$vocal) that separates the root from the
8967 ending. The vowel is searched from the end to the beginning and the
8968 first occurrence of \$vocal is returned.
8969 \end{itemize}
8971 Now, the same operations as in \textit{selec.php} are
8972 performed. Firstly, it makes sure that the entry is not yet in the
8973 dictionaries, and then generates the examples of the word inflected
8974 according to the paradigm previously selected. After this, it builds
8975 the string with the \texttt{XML} code that is going to be inserted in
8976 the source language monolingual dictionary. With the information on
8977 the lemmas entered in \textit{selec.php}, a text string is generated
8978 (\texttt{\$cad\_LR}) that contains the \texttt{XML} code for the
8979 monolingual dictionary. This string is displayed in a text box that
8980 can be manually edited. The same process is done to generate the
8981 string for the target language monolingual dictionary
8982 (\texttt{\$cad\_RL}) and for the bilingual dictionary
8983 (\texttt{\$cad\_bil}). Then, the
8984 possible comments and the name of the user making the entry are
8985 concatenated to these variables, if applicable. Finally, the form
8986 screen is completed adding the buttons for accepting, deleting and going
8987 back. The code to process each one of the possible actions is at the
8988 end of the file:
8989 \begin{itemize}
8990 \item \texttt{Insert: } In this case, it makes some character
8991 replacements so that the entry has the right format in the
8992 dictionaries, and inserts the strings \texttt{\$cad\_LR, \$cad\_bil,
8993 \$cad\_RL} in the source monolingual, bilingual and target
8994 monolingual dictionaries respectively (\texttt{\$fich\_LR,
8995 \$fich\_LRRL, \$fich\_RL}). If some error occurs when inserting the
8996 entry, a warning message is displayed. If \textit{insertarPro.php}
8997 was called from a word validation process (\textit{\$valida=1}),
8998 then the button "\textit{Continue}" is inserted to continue with the
8999 validation. If this is not the case, then a button to close the
9000 window is inserted, to allow the user to end the process.
9001 \item \texttt{Delete: } It deletes the entry from the file of entries
9002 pending validation.
9003 \end{itemize}
9005 \subsubsection{ins\_multip.php}
9007 It performs the same actions as \textit{insertarPro.php} but it is
9008 intended for multiword units. The main difference is the existence of
9009 two additional variables, \texttt{\$LRcua} and \texttt{\$RLcua}, that
9010 contain the invariable part of a multiword. When the entry is added to
9011 the dictionaries, this queue has to be inserted in the right place and
9012 the blanks have to be turned into \texttt{<b/>} tags.
9014 \subsubsection{insertar.php}
9016 The function of this module is very simple. It builds a text string
9017 with the information provided by \textit{selec.php} separated by
9018 tabs. This string contains all the required information to generate a
9019 dictionary entry:
9021 \texttt{\$LRlem.\$RLlem.\$paradigmaLR.\$direccion.\$paradigmaRL.}
9024 \texttt{\$tipo.\$comentarios.\$user.\$geneLR.\$geneRL.}
9028 This entry is saved in a file (\$fichSemi) that contains the queue
9029 with the entries waiting for validation inserted by non-professional
9030 users. When a professional user wishes to validate pending entries,
9031 the \textit{valida.php} module will read from this file.
9034 \subsubsection{verSemi.php}
9036 It displays the file of entries waiting for validation, in this way:
9037 it reads the file containing the entries (\textit{\$fichSemi}) and
9038 enters a loop that reads all the entries of the file. For each entry,
9039 it displays a line with the following information:
9041 \texttt{\$LRlem
9042 \$paradigmaLR
9043 \$direccion
9044 \$RLlem}
9046 \texttt{\$paradigmaRL
9047 \$tipo
9048 \$comentarios}
9050 \subsection{Dictionary files}
9052 The files containing the entries inserted from the form are saved in
9053 \texttt{/dics}. There are here two kinds of files:
9055 \begin{itemize}
9056 \item \texttt{apertium-ll-rr.xx.webform}: This is the file that
9057 contains the entries in \texttt{XML} code, ready to be copied to the
9058 final dictionaries. The name of the file has the presented structure,
9059 where \texttt{ll-rr} are the initials of the language pair of the
9060 translator and \texttt{xx} the initials of the language of the
9061 monolingual dictionary or the languages of the bilingual dictionary
9062 referred to, as applicable. For example, the initials of the
9063 Spanish-Catalan translator are \texttt{es-ca}. For this translator, we
9064 have the Spanish monolingual (\texttt{es}), the Catalan monolingual
9065 (\texttt{ca}) and the bilingual (\texttt{es-ca})
9066 dictionaries. Therefore, this directory will contain the following
9067 files for the Spanish-Catalan translator:
9068 \begin{center} \texttt{apertium-es-ca.es.webform
9069 apertium-es-ca.ca.webform apertium-es-ca.es-ca.webform}
9070 \end{center}
9073 \item \texttt{oo-mm.semi.dic}: This is the file containing the entries
9074 pending validation for a given language pair. \texttt{oo-mm} are the
9075 initials of the pair. For example, for the Spanish-Catalan translator
9076 this file would be: \texttt{es-ca.semi.dic}
9079 \end{itemize}
9081 \subsection{Paradigm files}
9082 \label{paradigm}
9084 The paradigms used for each language pair are specified in two
9085 \texttt{XML} files named \texttt{paradig.ll-rr.xx.xml}, where
9086 \texttt{xx} are the initials of the language and \texttt{ll-rr} the
9087 initials of the language pair. These files consist of a set of entries
9088 describing the paradigms or inflection models for the words of a given
9089 language. The \texttt{XML} file has the following parts:
9090 \begin{itemize}
9091 \item Head/root of the specification file.\\
9092 \begin{alltt}
9093 <?xml version="1.0" encoding="ISO-8859-1"?>
9094 <?xml-stylesheet type="text/xsl" href="paradigmas.xsl"?>
9095 <!DOCTYPE form SYSTEM "form.dtd">
9096 <form lang="oc" langpair="oc-ca">
9097 \end{alltt}
9098 The \textit{lang} attribute states the initials of the
9099 language for which paradigms are specified, and the \textit{langpair}
9100 attribute states the initials of the language pair of the translator
9101 for which the specification is made. It is required that the same
9102 directory containing the paradigm files contains the \texttt{form.dtd}
9103 file, which is the DTD specifying these files. You can find this DTD
9104 in the Appendix \ref{ss:dtdparadigmes}.
9105 \item A set of elements that define the paradigms. To explain its
9106 format, we reproduce the following example: \\
9107 \begin{alltt}
9108 <entry PoS="adj" nbr="sg_pl" gen="mf">
9109 <endings>
9110 <stem>amable</stem>
9111 <ending/>
9112 <ending>s</ending>
9113 </endings>
9114 <paradigms howmany="1">
9115 <par n="amable\_\_adj"/>
9116 </paradigms>
9117 </entry>
9118 \end{alltt}
9119 Each paradigm is specified in a \texttt{<entry>} element.
9120 This element can have three attributes:
9121 \begin{itemize}
9122 \item \textit{PoS}: the part of speech of the paradigm. It can take
9123 the values: acr, adj, adv, noun, pname, pr, verbo. \nota{also
9124 cnjadv?} It is mandatory for any part of speech.
9125 \item \textit{nbr}: the numbers admitted by the paradigm. It can
9126 take the values: sg, pl, sg\_pl, sp.
9127 \item \textit{gen}: the genders admitted by the paradigm. It can
9128 take the values: m, f, m f, mf.
9129 \end {itemize} It has two more elements:
9130 \begin{itemize}
9131 \item \texttt{endings}: the root and the endings used to select the
9132 paradigm in the form and display the inflection examples.
9134 \item \texttt{paradigms}: specification of the paradigm/s that
9135 define the inflection of an entry. It requires the attribute
9136 \textit{howmany} , which specifies the number of paradigms used by
9137 an entry. Each used paradigm is indicated in a line, where the name
9138 of the paradigm in the dictionary is inserted according to this
9139 format:
9140 \begin{center}
9141 \begin{alltt}
9142 <par n="long\_\_adj"/>
9143 \end{alltt}
9144 \end{center}
9145 \end{itemize}
9146 \end{itemize}
9148 From the \texttt{XML} paradigm file, it is necessary to generate the
9149 files directly used by the modules of the forms. Running the script
9150 \texttt{/private/gen\_paradig.sh}, the process is automatically done
9151 for all the available language pairs:
9152 \begin{alltt}
9153 # cd private
9154 # ./gen\_paradig.sh
9155 \end{alltt}
9156 To add a new paradigm to the forms, an appropriate entry
9157 has to be added to the \texttt{XML} paradigm file, and then run the
9158 previous script to update the working files.
9160 The automatic process can also be done manually if we do not want to
9161 update the files for all the installed language pairs. The manual
9162 generation of the working files has to be done with a \texttt{XSL}
9163 style sheet using the following command:
9164 \begin{alltt}
9165 # xsltproc paradigmas.xsl paradigm\_file.xml
9166 | ./creaparadig.awk
9167 \end{alltt}
9169 This action generates a working file for each part of speech. The
9170 generated files are saved in the directories
9171 \texttt{/private/paradigmas.ll-rr}. These directories contain the
9172 files with the paradigms that can be used for each language pair
9173 \texttt{ll-rr} and for each part of speech. Each one of these
9174 directories contain the following files:
9175 \begin{itemize}
9176 \item \texttt{paradigacr\_xx}: paradigms for acronyms in the language
9177 \texttt{xx}.
9178 \item \texttt{paradigadj\_xx}: paradigms for adjectives in the
9179 language \texttt{xx}.
9180 \item \texttt{paradigadv\_xx}: paradigms for adverbs in the language
9181 \texttt{xx}.
9182 \item \texttt{paradigcnjadv\_xx}: paradigms for adverbial conjunctions
9183 in the language \texttt{xx}.
9184 \item \texttt{paradigcnjcoo\_xx}: paradigms for copulative
9185 conjunctions in the language \texttt{xx}.\nota{MG: aquesta no està en
9186 la pàgina web del formulari}
9187 \item \texttt{paradigcnjsub\_xx}: paradigms for subordinating
9188 conjunctions in the language \texttt{xx}.\nota{ídem}
9189 \item \texttt{paradignoun\_xx}: paradigms for nouns in the language
9190 \texttt{xx}.
9191 \item \texttt{paradigpname\_xx}: paradigms for proper nouns in the
9192 language \texttt{xx}.
9193 \item \texttt{paradigpr\_xx}: paradigms for prepositions in the
9194 language \texttt{xx}.
9195 \item \texttt{paradigverb\_xx}: paradigms for verbs in the language
9196 \texttt{xx}.
9197 \end{itemize}
9199 The files consist of one entry per line. Each entry contains the
9200 following information:
9202 \begin{center} % use packages: array
9203 \begin{tabular}{lllll}
9204 \textit{examples} & \textit{number of paradigms} & \textit{model\_paradigms} & \textit{(numbers)} &
9205 \textit{(genders)}
9206 \end{tabular}
9207 \end{center}
9210 The separator used for the different parts of an entry is the tab.
9211 \begin{itemize}
9212 \item \textit{Examples}: the endings that will be used to generate the
9213 examples when the user chooses this paradigm as a model for the word
9214 being inserted.
9215 \item \textit{Number of paradigms}: the number of paradigms that are
9216 used in the dictionary to inflect this inflection model.
9217 \item \textit{Model paradigms}: the name that have in the dictionary
9218 the paradigm/s that will be used to inflect a new entry.
9219 \item \textit{(Numbers)}: Only completed for names, adjectives and
9220 acronyms. Refers to the grammatical number in the paradigm.
9221 \item \textit{(Genders)}: Only completed for names, adjectives and
9222 acronyms. Refers to the grammatical gender in the paradigm.
9223 \end{itemize}
9225 So, therefore, for the Spanish-Catalan translator we would have the
9226 directory \texttt{/private\-/paradigmas.es-ca} that would contain two
9227 \texttt{XML} files: \texttt{paradig.es-ca.es.xml} and
9228 \texttt{paradig.es-ca.ca.xml}, specifying the paradigms used in each
9229 language. From these files, you may generate all the paradigm files
9230 for the language pair using the command:
9231 \begin{alltt}
9232 # cd private/paradigmas.es-ca
9233 # xsltproc ../paradigmas.xsl paradig.es-ca.es.xml
9234 | ../creaparadig.awk
9235 # xsltproc ../paradigmas.xsl paradig.es-ca.ca.xml
9236 | ../creaparadig.awk
9237 \end{alltt}
9240 Or you can automatically generate them for all the language pairs,
9241 using:
9242 \begin{alltt}
9243 # ./private/gen\_paradig.sh
9244 \end{alltt}
9246 Among the generated working files, one would be, for example, a file
9247 called \texttt{paradigverb\_ca} that would contain the possible verb
9248 paradigms for Catalan, where a possible line might be:
9250 \begin{center}
9251 \texttt{abra/çar /ço /ci 1 abalan/çar\_\_vblex}
9252 \end{center}
9254 that is generated from the \texttt{XML} entry:
9256 \begin{alltt}
9257 <entry PoS="verb">
9258 <endings>
9259 <stem>abra</stem>
9260 <ending>çar</ending>
9261 <ending>ço</ending>
9262 <ending>ci</ending>
9263 </endings>
9264 <paradigms howmany="1">
9265 <par n="abalan/çar\_\_vblex"/>
9266 </paradigms>
9267 </entry>
9268 \end{alltt}
9272 \section{Using the forms}
9273 \label{ss:formus}
9274 \subsection{Introduction}
9277 When a user wants to insert new entries in a
9278 dictionary, he/she has to use a web navigator to connect to the
9279 address where the form server has been installed; for example:
9280 \begin{center} \texttt{http://xixona.dlsi.ua.es/forms}
9281 \end{center} A web page will be displayed with the portal of access to
9282 \texttt{Opentrad\- Apertium\- Insertion\- Form}. The left margin
9283 contains links to get more \textit{information} , \textit{download}
9284 the programs and \textit{contact} the administrator of the forms to
9285 request registration as a system user. To register as a user you will
9286 have to send an e-mail to the administrator.
9288 \nota{Canviar a tot arreu \emph{registrar} per \emph{inscribir}.} To
9289 insert new words, you will have to introduce the required data in the
9290 form and press the \textit{'Go On'} button; at this point you will
9291 have to identify yourself as a registered user, or else you will not
9292 be able to continue. There are two user registration types: you can be
9293 registered as a \emph{professional} or as a \emph{non-professional}
9294 user. Each mode has different functionalities, that are explained in
9295 the following section.
9297 \subsection{Insertion of entries}
9298 \label{insertion}
9300 \subsubsection{Professional mode}
9302 If you want to add a new entry to the dictionaries, you have to go to
9303 the section of the language pair you want to improve. There, you have
9304 to enter the source language lemma and the target language lemma, and
9305 select their part of speech. Press the \textit{Go on} button to
9306 continue.
9308 A new window is displayed, with the lemmas and some parameters used to
9309 define the entries. If the entry already exists in one of the
9310 dictionaries, a warning message is displayed and the system automatically
9311 selects one-way translation (from left to right or vice versa). If
9312 none of the dictionaries contain the entry, the entry will be entered
9313 for both directions.
9315 In this window you can do three actions:
9317 \notavisible{Cal repassar la primera oració del paràgraf següent;
9318 sembla que hi ha algun material que hauria de ser esborrat; Una altra
9319 cosa, els formularis en l'actualitat no tenen suport per a traduccions
9320 múltiples, segons sembla. Caldria fer constar aquesta circumstància en
9321 algun lloc.}
9322 \begin{itemize}
9323 \item Choose the paradigm for the SL and the TL lemmas (this is
9324 mandatory, the remaining actions are not).\footnote{Choosing the
9325 paradigm has to be done very carefully. You have to choose the
9326 paradigm that describes exactly the grammatical and inflection
9327 characteristics of the inserted word. In the case of adjectives,
9328 nouns and acronyms, you have to select a paradigm that fits the
9329 inflection of the word and the genders it may present. For example,
9330 in the case of acronyms you have to consider the gender and the
9331 number admitted by each possible paradigm; the paradigm BBC, for
9332 example, is for feminine singular acronyms, whereas SA is for
9333 feminine acronyms that may have plural form. In the case of proper
9334 nouns, you have to choose a different paradigm depending on whether
9335 the word is a proper noun of a thing (e.g. a newspaper), a person or
9336 a place.}
9337 \item Select the translation direction of the entry if it is different
9338 from the automatically suggested.
9339 \item Add comments to the entry, that will be included in the final
9340 dictionary.
9341 \end{itemize}
9343 Once the required actions have been done, you have to press
9344 \textit{'Go on'} if you want to confirm the entry or \textit{'Close'}
9345 if you want to cancel the insertion operation.
9347 The following and last screen displays the three generated
9348 \texttt{XML} entries for the SL monolingual, TL monolingual and
9349 bilingual dictionaries. These entries are displayed in three text
9350 boxes that can be edited if you want to do any change. Once you
9351 checked the entries, press the \textit{'Insert'} button to finally
9352 insert them in the corresponding dictionaries. You can also press the
9353 \textit{'Go back'} button to return to the previous step.
9355 \subsubsection{Non-professional mode}
9357 When a user enters the insertion system as a non-professional user,
9358 the word insertion mechanism is the same as for the professional user,
9359 with the difference that the entries will not be saved in the
9360 dictionaries generated by the forms, but will be entered in a queue of
9361 entries pending validation. The words in this queue will not be
9362 inserted in the dictionaries until a professional user validates them.
9364 \subsection{Validating entries}
9366 Professional users have two additional links in the screen for
9367 paradigm selection:
9368 \begin{itemize}
9369 \item \textit{See pairs to be validated}: Selecting this option will
9370 open a screen that displays the content of the file of entries pending
9371 validation; these are the entries inserted by non-professional
9372 users. This is a merely informative screen, which can be closed
9373 pressing the \textit{'Close'} button.
9374 \item \textit{Validate pairs}: This option allows a professional user
9375 to validate one by one the entries waiting for validation. Selecting
9376 this button will open the screen for the selection of paradigms
9377 already described in section \ref{insertion}. This screen will show
9378 the data selected by the user for the added entry. Now, the
9379 professional user can modify the lemmas, delete the entry or continue
9380 with the insertion process. If the user decides to proceed with the
9381 insertion, the process is the same as for a normal insertion; only at
9382 the end, when the entry is finally added to the dictionaries of the
9383 form, the control returns to the following entry of the queue pending
9384 validation and displays it.
9386 This process is repeated until all the words of the queue are
9387 validated or until the process is finished by selecting
9388 \textit{'Close'}.
9390 \end{itemize}
9394 \newpage
9395 \appendix
9397 \chapter[XML DTDs]{Document Type Definitions (DTD) in XML}
9398 \label{DTDs}
9400 \section{DTD for the format of dictionaries}
9401 \label{ss:dtd_dics}
9404 Document type definition for the format of morphological, bilingual
9405 and post-generation dictionaries in XML; this definition is provided
9406 with the \texttt{apertium} package (last version) which can be
9407 downloaded from \url{http://www.sourceforge.net}.
9409 The description of its elements can be found in Section
9410 \ref{formatodics}.
9414 \begin{small}
9415 \begin{alltt}
9416 <!\textsl{ELEMENT} \textbf{dictionary} (alphabet?, sdefs?,
9417 pardefs?, section+)>
9419 <!\textsl{ELEMENT} \textbf{alphabet} (\textsl{#PCDATA})>
9421 <!\textsl{ELEMENT} \textbf{sdefs} (sdef+)>
9423 <!\textsl{ELEMENT} \textbf{sdef} \textsl{EMPTY}>
9424 <!\textsl{ATTLIST} sdef n ID \textsl{#REQUIRED}>
9426 <!\textsl{ELEMENT} \textbf{pardefs} (pardef+)>
9428 <!\textsl{ELEMENT} \textbf{pardef} (e+)>
9429 <!\textsl{ATTLIST} pardef n CDATA \textsl{#REQUIRED}>
9431 <!\textsl{ELEMENT} \textbf{section} (e+)>
9433 <!\textsl{ATTLIST} section id ID \textsl{#REQUIRED}
9434 type (standard|inconditional|postblank) \textsl{#REQUIRED}>
9436 <!\textsl{ELEMENT} \textbf{e} (i | p | par | re)+>
9437 <!\textsl{ATTLIST} e r (LR|RL) \textsl{#IMPLIED}
9438 lm CDATA \textsl{#IMPLIED}
9439 a CDATA \textsl{#IMPLIED}
9440 c CDATA \textsl{#IMPLIED}
9442 <!\textsl{ELEMENT} \textbf{par} \textsl{EMPTY}>
9443 <!\textsl{ATTLIST} par n CDATA \textsl{#REQUIRED}>
9445 <!\textsl{ELEMENT} \textbf{i} (\textsl{#PCDATA} | b | s | g | j | a)*>
9447 <!\textsl{ELEMENT} \textbf{re} (\textsl{#PCDATA})>
9449 <!\textsl{ELEMENT} \textbf{p} (l, r)>
9451 <!\textsl{ELEMENT} \textbf{l} (\textsl{#PCDATA} | a | b | g | j | s)*>
9453 <!\textsl{ELEMENT} \textbf{r} (\textsl{#PCDATA} | a | b | g | j | s)*>
9455 <!\textsl{ELEMENT} \textbf{a} \textsl{EMPTY}>
9457 <!\textsl{ELEMENT} \textbf{b} \textsl{EMPTY}>
9459 <!\textsl{ELEMENT} \textbf{g} (\textsl{#PCDATA} | a | b | j | s)*>
9460 <!\textsl{ATTLIST} g i CDATA \textsl{#IMPLIED}>
9462 <!\textsl{ELEMENT} \textbf{j} \textsl{EMPTY}>
9464 <!\textsl{ELEMENT} \textbf{s} \textsl{EMPTY}>
9466 <!\textsl{ATTLIST} s n \textsl{IDREF} \textsl{#REQUIRED}>
9468 \end{alltt}
9469 \end{small}
9472 \subsection{Modification of the DTD of dictionaries for lexical
9473 selection}
9474 \label{dixdtd}
9476 The DTD for the format of dictionaries has been slightly modified so
9477 that dictionaries can be used in a system that has a lexical selection
9478 module. The change only affects the \texttt{<e>} element and is
9479 displayed next.
9483 \begin{small}
9484 \begin{alltt}
9487 <!\textsl{ATTLIST} e
9488 r (LR|RL) \textsl{#IMPLIED}
9489 lm \textsl{CDATA #IMPLIED}
9490 a \textsl{CDATA #IMPLIED}
9491 c \textsl{CDATA #IMPLIED}>
9492 i CDATA \textsl{#IMPLIED}
9493 slr CDATA \textsl{#IMPLIED}
9494 srl CDATA \textsl{#IMPLIED}>
9496 <!-- r: restriction LR: left-to-right,
9497 RL: right-to-left -->
9498 <!-- lm: lemma -->
9499 <!-- a: author -->
9500 <!-- c: comment -->
9501 <!-- i: ignore ('yes') means ignore, otherwise it is not ignored) -->
9502 <!-- slr: translation sense when translating from left to right -->
9503 <!-- srl: translation sense when translating from right to left -->
9506 \end{alltt}
9507 \end{small}
9512 \section[DTD for the tagger file]{DTD for the format of the tagger
9513 file}
9514 \label{ss:DTD_desambiguador}
9516 DTD that defines the format of the tagger specification file. This
9517 definition is provided with the \texttt{apertium} package (last
9518 version) which can be downloaded from
9519 \url{http://www.sourceforge.net}.
9521 The description of its elements can be found in
9522 Section~\ref{formatotagger}.
9524 \begin{small}
9525 \begin{alltt}
9526 <!\textsl{ELEMENT} \textbf{tagger} (tagset,forbid?,enforce-rules?,preferences?)>
9527 <!\textsl{ATTLIST} tagger name \textsl{CDATA} \textsl{#REQUIRED}>
9529 <!\textsl{ELEMENT} \textbf{tagset} (def-label+,def-mult*)>
9531 <!\textsl{ELEMENT} \textbf{def-label} (tags-item+)>
9532 <!\textsl{ATTLIST} def-label name \textsl{CDATA} \textsl{#REQUIRED}
9533 closed \textsl{CDATA} \textsl{#IMPLIED}>
9535 <!\textsl{ELEMENT} \textbf{tags-item} \textsl{#EMPTY}>
9536 <!\textsl{ATTLIST} tags-item tags \textsl{CDATA} \textsl{#REQUIRED}
9537 lemma \textsl{CDATA} \textsl{#IMPLIED}>
9539 <!\textsl{ELEMENT} \textbf{def-mult} (sequence+)>
9540 <!\textsl{ATTLIST} def-mult name \textsl{CDATA} \textsl{#REQUIRED}
9541 closed \textsl{CDATA} \textsl{#IMPLIED}>
9543 <!\textsl{ELEMENT} \textbf{sequence} ((tags-item|label-item)+)>
9545 <!\textsl{ELEMENT} \textbf{label-item} \textsl{#EMPTY}>
9546 <!\textsl{ATTLIST} label-item label \textsl{CDATA} \textsl{#REQUIRED}>
9548 <!\textsl{ELEMENT} \textbf{forbid} (label-sequence+)>
9550 <!\textsl{ELEMENT} \textbf{label-sequence} (label-item+)>
9552 <!\textsl{ELEMENT} \textbf{enforce-rules} (enforce-after+)>
9554 <!\textsl{ELEMENT} \textbf{enforce-after} (label-set)>
9555 <!\textsl{ATTLIST} enforce-after label \textsl{CDATA} \textsl{#REQUIRED}>
9557 <!\textsl{ELEMENT} \textbf{label-set} (label-item+)>
9559 <!\textsl{ELEMENT} \textbf{preferences} (prefer+)>
9561 <!\textsl{ELEMENT} \textbf{prefer} \textsl{EMPTY}>
9562 <!\textsl{ATTLIST} prefer tags \textsl{CDATA} \textsl{#REQUIRED}>
9563 \end{alltt}
9564 \end{small}
9568 \section[DTD of the chunker module]{DTD of the structural transfer
9569 module (chunker)}
9570 \label{ss:dtdtransfer}
9572 DTD for the format of the structural transfer rules in the
9573 \texttt{chunker} module. This definition is provided with the
9574 \texttt{apertium} package (version 2.0) which can be downloaded from
9575 \url{http://www.sourceforge.net}.
9577 Its elements are described in Section \ref{formatotransfer}.
9580 \begin{small}
9581 \begin{alltt}
9582 <!\textsl{ENTITY} \% condition "(and|or|not|equal|begins-with|
9583 ends-with|contains-substring|in)">
9584 <!\textsl{ENTITY} \% container "(var|clip)">
9585 <!\textsl{ENTITY} \% sentence "(let|out|choose|modify-case|
9586 call-macro|append)">
9587 <!\textsl{ENTITY} \% value "(b|clip|lit|lit-tag|var|get-case-from|
9588 case-of|concat)">
9589 <!\textsl{ENTITY} \% stringvalue "(clip|lit|var|get-case-from|
9590 case-of)">
9592 <!\textsl{ELEMENT} \textbf{transfer} (section-def-cats,
9593 section-def-attrs,
9594 section-def-vars,
9595 section-def-lists?,
9596 section-def-macros?,
9597 section-rules)>
9599 <!\textsl{ATTLIST} transfer default (lu|chunk) \textsl{#IMPLIED}>
9601 <!\textsl{ELEMENT} \textbf{section-def-cats} (def-cat+)>
9603 <!\textsl{ELEMENT} \textbf{def-cat} (cat-item+)>
9604 <!\textsl{ATTLIST} def-cat n ID \textsl{#REQUIRED}>
9606 <!\textsl{ELEMENT} \textbf{cat-item} \textsl{EMPTY}>
9607 <!\textsl{ATTLIST} cat-item lemma CDATA \textsl{#IMPLIED}
9608 tags CDATA \textsl{#REQUIRED} >
9610 <!\textsl{ELEMENT} \textbf{section-def-attrs} (def-attr+)>
9612 <!\textsl{ELEMENT} \textbf{def-attr} (attr-item+)>
9613 <!\textsl{ATTLIST} def-attr n ID \textsl{#REQUIRED}>
9615 <!\textsl{ELEMENT} \textbf{attr-item} \textsl{EMPTY}>
9616 <!\textsl{ATTLIST} attr-item tags CDATA \textsl{#IMPLIED}>
9618 <!\textsl{ELEMENT} \textbf{section-def-vars} (def-var+)>
9620 <!\textsl{ELEMENT} \textbf{def-var} \textsl{EMPTY}>
9621 <!\textsl{ATTLIST} def-var n ID \textsl{#REQUIRED}>
9623 <!\textsl{ELEMENT} \textbf{section-def-lists} (def-list)+>
9625 <!\textsl{ELEMENT} \textbf{def-list} (list-item+)>
9626 <!\textsl{ATTLIST} def-list n ID \textsl{#REQUIRED}>
9628 <!\textsl{ELEMENT} \textbf{list-item} \textsl{EMPTY}>
9629 <!\textsl{ATTLIST} list-item v CDATA \textsl{#REQUIRED}>
9631 <!\textsl{ELEMENT} \textbf{section-def-macros} (def-macro)+>
9633 <!\textsl{ELEMENT} \textbf{def-macro} (\%sentence;)+>
9634 <!\textsl{ATTLIST} def-macro n ID \textsl{#REQUIRED}>
9635 <!\textsl{ATTLIST} def-macro npar CDATA \textsl{#REQUIRED}>
9637 <!\textsl{ELEMENT} \textbf{section-rules} (rule+)>
9639 <!\textsl{ELEMENT} \textbf{rule} (pattern, action)>
9640 <!\textsl{ATTLIST} rule comment CDATA \textsl{#IMPLIED}>
9642 <!\textsl{ELEMENT} \textbf{pattern} (pattern-item+)>
9644 <!\textsl{ELEMENT} \textbf{pattern-item} \textsl{EMPTY}>
9645 <!\textsl{ATTLIST} pattern-item n \textsl{IDREF} \textsl{#REQUIRED}>
9647 <!\textsl{ELEMENT} \textbf{action} (\%sentence;)*>
9649 <!\textsl{ELEMENT} \textbf{choose} (when+,otherwise?)>
9651 <!\textsl{ELEMENT} \textbf{when} (test,(\%sentence;)*)>
9653 <!\textsl{ELEMENT} \textbf{otherwise} (\%sentence;)+>
9655 <!\textsl{ELEMENT} \textbf{test} (\%condition;)+>
9657 <!\textsl{ELEMENT} \textbf{and} ((\%condition;),(\%condition;)+)>
9659 <!\textsl{ELEMENT} \textbf{or} ((\%condition;),(\%condition;)+)>
9661 <!\textsl{ELEMENT} \textbf{not} (\%condition;)>
9663 <!\textsl{ELEMENT} \textbf{equal} (\%value;,\%value;)>
9664 <!\textsl{ATTLIST} equal caseless (no|yes) \textsl{#IMPLIED}>
9666 <!\textsl{ELEMENT} \textbf{begins-with} (\%value;,\%value;)>
9667 <!\textsl{ATTLIST} begins-with caseless (no|yes) \textsl{#IMPLIED}>
9669 <!\textsl{ELEMENT} \textbf{ends-with} (\%value;,\%value;)>
9670 <!\textsl{ATTLIST} ends-with caseless (no|yes) \textsl{#IMPLIED}>
9672 <!\textsl{ELEMENT} \textbf{contains-substring} (\%value;,\%value;)>
9673 <!\textsl{ATTLIST} contains-substring caseless (no|yes) \textsl{#IMPLIED}>
9675 <!\textsl{ELEMENT} \textbf{in} (\%value;, list)>
9676 <!\textsl{ATTLIST} in caseless (no|yes) \textsl{#IMPLIED}>
9678 <!\textsl{ELEMENT} \textbf{list} \textsl{EMPTY}>
9679 <!\textsl{ATTLIST} list n \textsl{IDREF} \textsl{#REQUIRED}>
9681 <!\textsl{ELEMENT} \textbf{let} (\%container;, \%value;)>
9683 <!\textsl{ELEMENT} \textbf{append} (\%value;)+>
9684 <!\textsl{ATTLIST} append n \textsl{IDREF} \textsl{#REQUIRED}>
9686 <!\textsl{ELEMENT} \textbf{out} (mlu|lu|b|chunk)+>
9688 <!\textsl{ELEMENT} \textbf{modify-case} (\%container;, \%stringvalue;)>
9690 <!\textsl{ELEMENT} \textbf{call-macro} (with-param)*>
9691 <!\textsl{ATTLIST} call-macro n \textsl{IDREF} \textsl{#REQUIRED}>
9693 <!\textsl{ELEMENT} \textbf{with-param} \textsl{EMPTY}>
9694 <!\textsl{ATTLIST} with-param pos CDATA \textsl{#REQUIRED}>
9696 <!\textsl{ELEMENT} \textbf{clip} \textsl{EMPTY}>
9697 <!\textsl{ATTLIST} clip pos CDATA \textsl{#REQUIRED}
9698 side (sl|tl) \textsl{#REQUIRED}
9699 part CDATA \textsl{#REQUIRED}
9700 queue CDATA \textsl{#IMPLIED}
9701 link-to CDATA \textsl{#IMPLIED}>
9703 <!\textsl{ELEMENT} \textbf{lit} \textsl{EMPTY}>
9704 <!\textsl{ATTLIST} lit v CDATA \textsl{#REQUIRED}>
9706 <!\textsl{ELEMENT} \textbf{lit-tag} \textsl{EMPTY}>
9707 <!\textsl{ATTLIST} lit-tag v CDATA \textsl{#REQUIRED}>
9709 <!\textsl{ELEMENT} \textbf{var} \textsl{EMPTY}>
9710 <!\textsl{ATTLIST} var n \textsl{IDREF} \textsl{#REQUIRED}>
9712 <!\textsl{ELEMENT} \textbf{get-case-from} (clip|lit|var)>
9713 <!\textsl{ATTLIST} get-case-from pos CDATA \textsl{#REQUIRED}>
9715 <!\textsl{ELEMENT} \textbf{case-of} \textsl{EMPTY}>
9716 <!\textsl{ATTLIST} case-of pos CDATA \textsl{#REQUIRED}
9717 side (sl|tl) \textsl{#REQUIRED}
9718 part CDATA \textsl{#REQUIRED}>
9720 <!\textsl{ELEMENT} \textbf{concat} (\%value;)+>
9722 <!\textsl{ELEMENT} \textbf{mlu} (lu+)>
9724 <!\textsl{ELEMENT} \textbf{lu} (\%value;)+>
9726 <!\textsl{ELEMENT} \textbf{chunk} (tags,(mlu|lu|b)+)>
9727 <!\textsl{ATTLIST} chunk name CDATA \textsl{#IMPLIED}
9728 namefrom CDATA \textsl{#IMPLIED}
9729 case CDATA \textsl{#IMPLIED}>
9731 <!\textsl{ELEMENT} \textbf{tags} (tag+)>
9732 <!\textsl{ELEMENT} \textbf{tag} (\%value;)>
9734 <!\textsl{ELEMENT} \textbf{b} \textsl{EMPTY}>
9735 <!\textsl{ATTLIST} b pos CDATA \textsl{#IMPLIED}>
9737 \end{alltt}
9738 \end{small}
9742 \newpage
9743 \section{DTD of the interchunk module}
9744 \label{ss:dtdinterchunk}
9746 DTD for the format of the structural transfer rules in the
9747 \texttt{interchunk} module. This definition is provided with the
9748 \texttt{apertium} package (version 2.0) which can be downloaded from
9749 \url{http://www.sourceforge.net}.
9751 Its elements are described in Section \ref{formatotransfer}.
9754 \begin{small}
9755 \begin{alltt}
9757 <!\textsl{ENTITY} \% condition "(and|or|not|equal|begins-with|
9758 ends-with|contains-substring|in)">
9759 <!\textsl{ENTITY} \% container "(var|clip)">
9760 <!\textsl{ENTITY} \% sentence "(let|out|choose|modify-case|
9761 call-macro|append)">
9762 <!\textsl{ENTITY} \% value "(b|clip|lit|lit-tag|var|get-case-from|
9763 case-of|concat)">
9764 <!\textsl{ENTITY} \% stringvalue "(clip|lit|var|get-case-from|
9765 case-of)">
9767 <!\textsl{ELEMENT} \textbf{interchunk} (section-def-cats,
9768 section-def-attrs,
9769 section-def-vars,
9770 section-def-lists?,
9771 section-def-macros?,
9772 section-rules)>
9774 <!\textsl{ELEMENT} \textbf{section-def-cats} (def-cat+)>
9776 <!\textsl{ELEMENT} \textbf{def-cat} (cat-item+)>
9777 <!\textsl{ATTLIST} def-cat n ID \textsl{#REQUIRED}>
9779 <!\textsl{ELEMENT} \textbf{cat-item} \textsl{EMPTY}>
9780 <!\textsl{ATTLIST} cat-item lemma CDATA \textsl{#IMPLIED}
9781 tags CDATA \textsl{#REQUIRED} >
9783 <!\textsl{ELEMENT} \textbf{section-def-attrs} (def-attr+)>
9785 <!\textsl{ELEMENT} \textbf{def-attr} (attr-item+)>
9786 <!\textsl{ATTLIST} def-attr n ID \textsl{#REQUIRED}>
9788 <!\textsl{ELEMENT} \textbf{attr-item} \textsl{EMPTY}>
9789 <!\textsl{ATTLIST} attr-item tags CDATA \textsl{#IMPLIED}>
9791 <!\textsl{ELEMENT} \textbf{section-def-vars} (def-var+)>
9793 <!\textsl{ELEMENT} \textbf{def-var} \textsl{EMPTY}>
9794 <!\textsl{ATTLIST} def-var n ID \textsl{#REQUIRED}>
9796 <!\textsl{ELEMENT} \textbf{section-def-lists} (def-list)+>
9798 <!\textsl{ELEMENT} \textbf{def-list} (list-item+)>
9799 <!\textsl{ATTLIST} def-list n ID \textsl{#REQUIRED}>
9801 <!\textsl{ELEMENT} \textbf{list-item} \textsl{EMPTY}>
9802 <!\textsl{ATTLIST} list-item v CDATA \textsl{#REQUIRED}>
9804 <!\textsl{ELEMENT} \textbf{section-def-macros} (def-macro)+>
9806 <!\textsl{ELEMENT} \textbf{def-macro} (\%sentence;)+>
9807 <!\textsl{ATTLIST} def-macro n ID \textsl{#REQUIRED}>
9808 <!\textsl{ATTLIST} def-macro npar CDATA \textsl{#REQUIRED}>
9810 <!\textsl{ELEMENT} \textbf{section-rules} (rule+)>
9812 <!\textsl{ELEMENT} \textbf{rule} (pattern, action)>
9813 <!\textsl{ATTLIST} rule comment CDATA \textsl{#IMPLIED}>
9815 <!\textsl{ELEMENT} \textbf{pattern} (pattern-item+)>
9817 <!\textsl{ELEMENT} \textbf{pattern-item} \textsl{EMPTY}>
9818 <!\textsl{ATTLIST} pattern-item n \textsl{IDREF} \textsl{#REQUIRED}>
9820 <!\textsl{ELEMENT} \textbf{action} (\%sentence;)*>
9822 <!\textsl{ELEMENT} \textbf{choose} (when+,otherwise?)>
9824 <!\textsl{ELEMENT} \textbf{when} (test,(\%sentence;)*)>
9826 <!\textsl{ELEMENT} \textbf{otherwise} (\%sentence;)+>
9828 <!\textsl{ELEMENT} \textbf{test} (\%condition;)+>
9830 <!\textsl{ELEMENT} \textbf{and} ((\%condition;),(\%condition;)+)>
9832 <!\textsl{ELEMENT} \textbf{or} ((\%condition;),(\%condition;)+)>
9834 <!\textsl{ELEMENT} \textbf{not} (\%condition;)>
9836 <!\textsl{ELEMENT} \textbf{equal} (\%value;,\%value;)>
9837 <!\textsl{ATTLIST} equal caseless (no|yes) \textsl{#IMPLIED}>
9839 <!\textsl{ELEMENT} \textbf{begins-with} (\%value;,\%value;)>
9840 <!\textsl{ATTLIST} begins-with caseless (no|yes) \textsl{#IMPLIED}>
9842 <!\textsl{ELEMENT} \textbf{ends-with} (\%value;,\%value;)>
9843 <!\textsl{ATTLIST} ends-with caseless (no|yes) \textsl{#IMPLIED}>
9845 <!\textsl{ELEMENT} \textbf{contains-substring} (\%value;,\%value;)>
9846 <!\textsl{ATTLIST} contains-substring caseless (no|yes) \textsl{#IMPLIED}>
9848 <!\textsl{ELEMENT} \textbf{in} (\%value;, list)>
9849 <!\textsl{ATTLIST} in caseless (no|yes) \textsl{#IMPLIED}>
9851 <!\textsl{ELEMENT} \textbf{list} \textsl{EMPTY}>
9852 <!\textsl{ATTLIST} list n \textsl{IDREF} \textsl{#REQUIRED}>
9854 <!\textsl{ELEMENT} \textbf{let} (\%container;, \%value;)>
9856 <!\textsl{ELEMENT} \textbf{append} (\%value;)+>
9857 <!\textsl{ATTLIST} append n \textsl{IDREF} \textsl{#REQUIRED}>
9859 <!\textsl{ELEMENT} \textbf{out} (b|chunk)+>
9861 <!\textsl{ELEMENT} \textbf{modify-case} (\%container;, \%stringvalue;)>
9863 <!\textsl{ELEMENT} \textbf{call-macro} (with-param)*>
9864 <!\textsl{ATTLIST} call-macro n \textsl{IDREF} \textsl{#REQUIRED}>
9866 <!\textsl{ELEMENT} \textbf{with-param} \textsl{EMPTY}>
9867 <!\textsl{ATTLIST} with-param pos CDATA \textsl{#REQUIRED}>
9869 <!\textsl{ELEMENT} \textbf{clip} \textsl{EMPTY}>
9870 <!\textsl{ATTLIST} clip pos CDATA \textsl{#REQUIRED}
9871 part CDATA \textsl{#REQUIRED}>
9873 <!\textsl{ELEMENT} \textbf{lit} \textsl{EMPTY}>
9874 <!\textsl{ATTLIST} lit v CDATA \textsl{#REQUIRED}>
9876 <!\textsl{ELEMENT} \textbf{lit-tag} \textsl{EMPTY}>
9877 <!\textsl{ATTLIST} lit-tag v CDATA \textsl{#REQUIRED}>
9879 <!\textsl{ELEMENT} \textbf{var} \textsl{EMPTY}>
9880 <!\textsl{ATTLIST} var n \textsl{IDREF} \textsl{#REQUIRED}>
9882 <!\textsl{ELEMENT} \textbf{get-case-from} (clip|lit|var)>
9883 <!\textsl{ATTLIST} get-case-from pos CDATA \textsl{#REQUIRED}>
9885 <!\textsl{ELEMENT} \textbf{case-of} \textsl{EMPTY}>
9886 <!\textsl{ATTLIST} case-of pos CDATA \textsl{#REQUIRED}
9887 part CDATA \textsl{#REQUIRED}>
9889 <!\textsl{ELEMENT} \textbf{concat} (\%value;)+>
9891 <!\textsl{ELEMENT} \textbf{chunk} (\%value;)+>
9893 <!\textsl{ELEMENT} \textbf{pseudolemma} (\%value;)>
9895 <!\textsl{ELEMENT} \textbf{b} \textsl{EMPTY}>
9896 <!\textsl{ATTLIST} b pos CDATA \textsl{#IMPLIED}>
9898 \end{alltt}
9899 \end{small}
9901 \newpage
9903 \section{DTD of the postchunk module}
9904 \label{ss:dtdpostchunk}
9906 DTD for the format of the structural transfer rules in the
9907 \texttt{postchunk} module. This definition is provided with the
9908 \texttt{apertium} package (version 2.0) which can be downloaded from
9909 \url{http://www.sourceforge.net}.
9911 Its elements are described in Section \ref{formatotransfer}.
9915 \begin{small}
9916 \begin{alltt}
9917 <!\textsl{ENTITY} \% condition "(and|or|not|equal|begins-with|
9918 ends-with|contains-substring|in)">
9919 <!\textsl{ENTITY} \% container "(var|clip)">
9920 <!\textsl{ENTITY} \% sentence "(let|out|choose|modify-case|
9921 call-macro|append)">
9922 <!\textsl{ENTITY} \% value "(b|clip|lit|lit-tag|var|get-case-from|
9923 case-of|concat)">
9924 <!\textsl{ENTITY} \% stringvalue "(clip|lit|var|get-case-from|
9925 case-of)">
9927 <!\textsl{ELEMENT} \textbf{postchunk} (section-def-cats,
9928 section-def-attrs,
9929 section-def-vars,
9930 section-def-lists?,
9931 section-def-macros?,
9932 section-rules)>
9934 <!\textsl{ELEMENT} \textbf{section-def-cats} (def-cat+)>
9936 <!\textsl{ELEMENT} \textbf{def-cat} (cat-item+)>
9937 <!\textsl{ATTLIST} def-cat n ID \textsl{#REQUIRED}>
9939 <!\textsl{ELEMENT} \textbf{cat-item} \textsl{EMPTY}>
9940 <!\textsl{ATTLIST} cat-item name CDATA \textsl{#REQUIRED}>
9942 <!\textsl{ELEMENT} \textbf{section-def-attrs} (def-attr+)>
9944 <!\textsl{ELEMENT} \textbf{def-attr} (attr-item+)>
9945 <!\textsl{ATTLIST} def-attr n ID \textsl{#REQUIRED}>
9947 <!\textsl{ELEMENT} \textbf{attr-item} \textsl{EMPTY}>
9948 <!\textsl{ATTLIST} attr-item tags CDATA \textsl{#IMPLIED}>
9950 <!\textsl{ELEMENT} \textbf{section-def-vars} (def-var+)>
9952 <!\textsl{ELEMENT} \textbf{def-var} \textsl{EMPTY}>
9953 <!\textsl{ATTLIST} def-var n ID \textsl{#REQUIRED}>
9955 <!\textsl{ELEMENT} \textbf{section-def-lists} (def-list)+>
9957 <!\textsl{ELEMENT} \textbf{def-list} (list-item+)>
9958 <!\textsl{ATTLIST} def-list n ID \textsl{#REQUIRED}>
9960 <!\textsl{ELEMENT} \textbf{list-item} \textsl{EMPTY}>
9961 <!\textsl{ATTLIST} list-item v CDATA \textsl{#REQUIRED}>
9963 <!\textsl{ELEMENT} \textbf{section-def-macros} (def-macro)+>
9965 <!\textsl{ELEMENT} \textbf{def-macro} (\%sentence;)+>
9966 <!\textsl{ATTLIST} def-macro n ID \textsl{#REQUIRED}>
9967 <!\textsl{ATTLIST} def-macro npar CDATA \textsl{#REQUIRED}>
9969 <!\textsl{ELEMENT} \textbf{section-rules} (rule+)>
9971 <!\textsl{ELEMENT} \textbf{rule} (pattern, action)>
9972 <!\textsl{ATTLIST} rule comment CDATA \textsl{#IMPLIED}>
9974 <!\textsl{ELEMENT} \textbf{pattern} (pattern-item+)>
9976 <!\textsl{ELEMENT} \textbf{pattern-item} \textsl{EMPTY}>
9977 <!\textsl{ATTLIST} pattern-item n \textsl{IDREF} \textsl{#REQUIRED}>
9979 <!\textsl{ELEMENT} \textbf{action} (\%sentence;)*>
9981 <!\textsl{ELEMENT} \textbf{choose} (when+,otherwise?)>
9983 <!\textsl{ELEMENT} \textbf{when} (test,(\%sentence;)*)>
9985 <!\textsl{ELEMENT} \textbf{otherwise} (\%sentence;)+>
9987 <!\textsl{ELEMENT} \textbf{test} (\%condition;)+>
9989 <!\textsl{ELEMENT} \textbf{and} ((\%condition;),(\%condition;)+)>
9991 <!\textsl{ELEMENT} \textbf{or} ((\%condition;),(\%condition;)+)>
9993 <!\textsl{ELEMENT} \textbf{not} (\%condition;)>
9995 <!\textsl{ELEMENT} \textbf{equal} (\%value;,\%value;)>
9996 <!\textsl{ATTLIST} equal caseless (no|yes) \textsl{#IMPLIED}>
9998 <!\textsl{ELEMENT} \textbf{begins-with} (\%value;,\%value;)>
9999 <!\textsl{ATTLIST} begins-with caseless (no|yes) \textsl{#IMPLIED}>
10001 <!\textsl{ELEMENT} \textbf{ends-with} (\%value;,\%value;)>
10002 <!\textsl{ATTLIST} ends-with caseless (no|yes) \textsl{#IMPLIED}>
10004 <!\textsl{ELEMENT} \textbf{contains-substring} (\%value;,\%value;)>
10005 <!\textsl{ATTLIST} contains-substring caseless (no|yes) \textsl{#IMPLIED}>
10007 <!\textsl{ELEMENT} \textbf{in} (\%value;, list)>
10008 <!\textsl{ATTLIST} in caseless (no|yes) \textsl{#IMPLIED}>
10010 <!\textsl{ELEMENT} \textbf{list} \textsl{EMPTY}>
10011 <!\textsl{ATTLIST} list n \textsl{IDREF} \textsl{#REQUIRED}>
10013 <!\textsl{ELEMENT} \textbf{let} (\%container;, \%value;)>
10015 <!\textsl{ELEMENT} \textbf{append} (\%value;)+>
10016 <!\textsl{ATTLIST} append n \textsl{IDREF} \textsl{#REQUIRED}>
10018 <!\textsl{ELEMENT} \textbf{out} (b|lu|mlu)+>
10020 <!\textsl{ELEMENT} \textbf{modify-case} (\%container;, \%stringvalue;)>
10022 <!\textsl{ELEMENT} \textbf{call-macro} (with-param)*>
10023 <!\textsl{ATTLIST} call-macro n \textsl{IDREF} \textsl{#REQUIRED}>
10025 <!\textsl{ELEMENT} \textbf{with-param} \textsl{EMPTY}>
10026 <!\textsl{ATTLIST} with-param pos CDATA \textsl{#REQUIRED}>
10028 <!\textsl{ELEMENT} \textbf{clip} \textsl{EMPTY}>
10029 <!\textsl{ATTLIST} clip pos CDATA \textsl{#REQUIRED}
10030 part CDATA \textsl{#REQUIRED}>
10032 <!\textsl{ELEMENT} \textbf{lit} \textsl{EMPTY}>
10033 <!\textsl{ATTLIST} lit v CDATA \textsl{#REQUIRED}>
10035 <!\textsl{ELEMENT} \textbf{lit-tag} \textsl{EMPTY}>
10036 <!\textsl{ATTLIST} lit-tag v CDATA \textsl{#REQUIRED}>
10038 <!\textsl{ELEMENT} \textbf{var} \textsl{EMPTY}>
10039 <!\textsl{ATTLIST} var n \textsl{IDREF} \textsl{#REQUIRED}>
10041 <!\textsl{ELEMENT} \textbf{get-case-from} (clip|lit|var)>
10042 <!\textsl{ATTLIST} get-case-from pos CDATA \textsl{#REQUIRED}>
10044 <!\textsl{ELEMENT} \textbf{case-of} \textsl{EMPTY}>
10045 <!\textsl{ATTLIST} case-of pos CDATA \textsl{#REQUIRED}
10046 part CDATA \textsl{#REQUIRED}>
10048 <!\textsl{ELEMENT} \textbf{concat} (\%value;)+>
10050 <!\textsl{ELEMENT} \textbf{mlu} (lu+)>
10052 <!\textsl{ELEMENT} \textbf{lu} (\%value;)+>
10054 <!\textsl{ELEMENT} \textbf{b} \textsl{EMPTY}>
10055 <!\textsl{ATTLIST} b pos CDATA \textsl{#IMPLIED}>
10057 \end{alltt}
10058 \end{small}
10060 \newpage
10063 \section[DTD for the format rules]{DTD for the format specification
10064 rules}
10065 \label{ss:dtd_formato}
10067 DTD for the format specification rules. This definition can be
10068 downloaded from the web page
10069 \url{http://cvs.sourceforge.net/viewcvs.py/apertium/apertium/apertium/format.dtd}. \nota{needs
10070 updating}
10073 Its elements are described in Section \ref{ss:reglasformato}.
10075 \begin{small}
10076 \begin{alltt}
10077 <!\textsl{ELEMENT} \textbf{format} (options,rules)>
10078 <!\textsl{ATTLIST} format name \textsl{CDATA} \textsl{#REQUIRED}>
10080 <!\textsl{ELEMENT} \textbf{options} (largeblocks, input, output,
10081 escape-chars, space-chars, case-sensitive)>
10083 <!\textsl{ELEMENT} \textbf{largeblocks} \textsl{EMPTY}>
10084 <!\textsl{ATTLIST} largeblocks size \textsl{CDATA} \textsl{#REQUIRED}>
10086 <!\textsl{ELEMENT} \textbf{input} \textsl{EMPTY}>
10087 <!\textsl{ATTLIST} input zip-path \textsl{CDATA} \textsl{#IMPLIED}
10088 encoding \textsl{CDATA} \textsl{#REQUIRED}>
10090 <!\textsl{ELEMENT} \textbf{output} \textsl{EMPTY}>
10091 <!\textsl{ATTLIST} output zip-path \textsl{CDATA} \textsl{#IMPLIED}
10092 encoding \textsl{CDATA} \textsl{#REQUIRED}>
10094 <!\textsl{ELEMENT} \textbf{escape-chars} \textsl{EMPTY}>
10095 <!\textsl{ATTLIST} escape-chars regexp \textsl{CDATA} \textsl{#REQUIRED}>
10097 <!\textsl{ELEMENT} \textbf{space-chars} \textsl{EMPTY}>
10098 <!\textsl{ATTLIST} space-chars regexp \textsl{CDATA} \textsl{#REQUIRED}>
10100 <!\textsl{ELEMENT} \textbf{case-sensitive} \textsl{EMPTY}>
10101 <!\textsl{ATTLIST} case-sensitive value (yes|no) \textsl{#REQUIRED}>
10103 <!\textsl{ELEMENT} \textbf{rules} (format-rule|replacement-rule)+>
10105 <!\textsl{ELEMENT} \textbf{format-rule} (begin-end|(begin,end))>
10106 <!\textsl{ATTLIST} format-rule eos (yes|no) \textsl{#IMPLIED}
10107 priority \textsl{CDATA} \textsl{#REQUIRED}>
10109 <!\textsl{ELEMENT} \textbf{begin-end} \textsl{EMPTY}>
10110 <!\textsl{ATTLIST} begin-end regexp \textsl{CDATA} \textsl{#REQUIRED}>
10112 <!\textsl{ELEMENT} \textbf{begin} \textsl{EMPTY}>
10113 <!\textsl{ATTLIST} begin regexp \textsl{CDATA} \textsl{#REQUIRED}>
10115 <!\textsl{ELEMENT} \textbf{end} \textsl{EMPTY}>
10116 <!\textsl{ATTLIST} end regexp \textsl{CDATA} \textsl{#REQUIRED}>
10118 <!\textsl{ELEMENT} \textbf{replacement-rule} (replace+)>
10119 <!\textsl{ATTLIST} replacement-rule regexp \textsl{CDATA} \textsl{#REQUIRED}>
10121 <!\textsl{ELEMENT} \textbf{replace} \textsl{EMPTY}>
10122 <!\textsl{ATTLIST} replace source \textsl{CDATA} \textsl{#REQUIRED}
10123 target \textsl{CDATA} \textsl{#REQUIRED}
10124 prefer (yes|no) \textsl{#IMPLIED}>
10126 \end{alltt}
10127 \end{small}
10129 \newpage
10130 \section{DTD for the form paradigms}
10131 \label{ss:dtdparadigmes}
10133 DTD for the format of the paradigm files used in the forms. This
10134 definition is included in the package
10135 \texttt{apertium-lexical-webform}.
10137 \begin{small}
10138 \begin{alltt}
10141 <!\textsl{ELEMENT} \textbf{form} (entry)+>
10143 <!\textsl{ATTLIST} \textbf{form}
10144 lang CDATA \textsl{#REQUIRED}
10145 langpair CDATA \textsl{#REQUIRED}>
10147 <!\textsl{ELEMENT} \textbf{entry} (endings, paradigms)+>
10149 <!\textsl{ATTLIST} \textbf{entry}
10150 PoS CDATA \textsl{#REQUIRED}
10151 nbr CDATA \textsl{#IMPLIED}
10152 gen CDATA \textsl{#IMPLIED}>
10154 <!\textsl{ELEMENT} \textbf{endings} (stem, ending+)>
10156 <!\textsl{ELEMENT} \textbf{stem} (\textsl{#PCDATA})>
10158 <!\textsl{ELEMENT} \textbf{ending} (\textsl{#PCDATA})>
10160 <!\textsl{ELEMENT} \textbf{paradigms} (par+)>
10162 <!\textsl{ATTLIST} \textbf{paradigms} howmany CDATA \textsl{#REQUIRED}>
10164 <!\textsl{ELEMENT} \textbf{par} \textsl{EMPTY}>
10166 <!\textsl{ATTLIST} \textbf{par} n CDATA \textsl{#REQUIRED}>
10169 \end{alltt}
10170 \end{small}
10174 \chapter[Grammatical symbols]{Grammatical symbols used in the modules}
10175 \label{se:simbolosmorf}
10179 \section[Dictionary symbols]{Grammatical symbols used in dictionaries}
10181 \subsection{List of symbols}
10184 \begin{tabular}{ll}
10186 \textbf{aa} & adjective-adjective (function of relative pronoun) \\
10187 \textbf{acr} & acronym \\ \textbf{al} & others (for proper nouns) \\
10188 \textbf{an} & adjective-noun (function of relative pronoun) \\
10189 \textbf{ant} & antroponym \\ \textbf{cni} & conditional \\
10190 \textbf{cnjadv} & adverbial conjunction\\ \textbf{cnjcoo} &
10191 co-ordinating conjunction\\ \textbf{cnjsub} & subordinating
10192 conjunction\\ \textbf{def} & definite \\ \textbf{dem} & demonstrative
10193 \\ \textbf{det} & determiner\\ \textbf{detnt} & neuter determiner \\
10194 \textbf{enc} & enclitic\\ \textbf{f} & feminine\\ \textbf{fti} &
10195 future indicative\\ \textbf{fts} & future subjunctive\\ \textbf{ger}
10196 & gerund\\ \textbf{ifi} & perfect preterite\\ \textbf{ij} &
10197 interjection\\ \textbf{imp} & imperative\\ \textbf{ind} &
10198 indefinite\\ \textbf{inf} & infinitive\\
10200 \end{tabular} \newpage
10202 \begin{tabular}{ll} \textbf{itg} & interrogative\\
10203 \textbf{loc} & locative\\
10204 \textbf{lpar} & ([\\ \textbf{lquest} & ¿\\ \textbf{m} & masculine\\
10205 \textbf{mf} & masculine-feminine\\ \textbf{n} & noun\\ \textbf{nn} &
10206 noun-noun (function of relative pronoun)\\ \textbf{np} & proper noun\\
10207 \textbf{nt} & neuter\\ \textbf{num} & numeral - number\\ \textbf{p1} &
10208 first person\\ \textbf{p2} & second person\\ \textbf{p3} & third
10209 person\\ \textbf{pii} & imperfect preterite indicative \\ \textbf{pis}
10210 & imperfect preterite subjunctive \\ \textbf{pl} & plural \\
10211 \textbf{pos} & possessive\\ \textbf{pp} & participle\\ \textbf{pr} &
10212 preposition\\ \textbf{preadv} & preadverb\\ \textbf{predet} &
10213 predeterminer\\ \textbf{pri} & present indicative\\ \textbf{prn} &
10214 pronoun\\ \textbf{pro} & proclitic\\ \textbf{prs} & present
10215 subjunctive\\ \textbf{ref} & reflexive\\ \textbf{rel} & relative\\
10216 \textbf{rpar} & )]\\ \textbf{sent} & . ? ; : ! \\ \textbf{sg} &
10217 singular\\ \textbf{sp} & singular-plural\\ \textbf{sup} &
10218 superlative\\ \textbf{tn} & tonic\\ \textbf{vaux} & auxiliary verb\\
10219 \textbf{vbhaver}& verb \emph{to have}\\ \textbf{vblex} & lexical
10220 verb\\ \textbf{vbmod} & modal verb\\ \textbf{vbser} & verb \emph{to
10223 \end{tabular}
10225 \newpage
10226 \subsection{Specification of lexical forms}
10228 Order for the placement of grammatical symbols in the morphological
10229 dictionaries of this system (from left to right in the table). The
10230 examples in brackets are from Spanish. \\ \\
10232 \begin{footnotesize}
10233 \begin{tabular}{|l|llllll|}
10235 \hline \textbf{Common adjectives} & \textbf{PoS} & \textbf{Gender} &
10236 \textbf{Number} &&& \\\cline{2-7}
10237 (difícil, rojo) & adj & m & sg &&& \\ & & f & pl &&& \\ & & mf & sp
10238 &&& \\ \hline \textbf{Interrogative, possessive,} &
10239 \textbf{PoS} & \textbf{Type} & \textbf{Gender}
10240 &\textbf{Number}&& \\\cline{2-7}
10241 \textbf{indetermined and superlative} & adj & itg & m & sg &&\\
10242 \textbf{adjectives} & & pos & f & pl &&\\ (qué, tus, otra, buenísimo)
10243 & & ind & mf & sp &&\\ & & sup & & &&\\\hline
10246 \textbf{Adverbs} & \textbf{PoS} &&&&&\\\cline{2-7} (siempre, mañana)&
10247 adv &&&&&\\\hline
10249 \textbf{Preadverbs} & \textbf{PoS} &&&&&\\\cline{2-7} (muy, tan)&
10250 preadv &&&&&\\\hline
10252 \textbf{Interrogative adverbs} & \textbf{PoS}
10253 &\textbf{Type}&&&&\\\cline{2-7} (dónde) & adv & itg &&&&\\\hline
10255 \textbf{Adverbial conjunctions} & \textbf{PoS} &&&&&\\\cline{2-7}
10256 (que, así como) & cnjadv &&&&&\\ & cnjcoo &&&&&\\ & cnjsub
10257 &&&&&\\\hline
10260 \textbf{Determiners} & \textbf{PoS} & \textbf{Type} & \textbf{Gender}
10261 &\textbf{Number}&& \\\cline{2-7} (el, uno, este, mi) & det & def & m &
10262 sg &&\\ & & ind & f & pl &&\\ & & dem & mf & sp &&\\ & & pos & &
10263 &&\\\hline
10265 \textbf{Neuter determiners} & \textbf{PoS} &&&&&\\\cline{2-7} (lo)&
10266 detnt &&&&&\\\hline
10268 \textbf{Predeterminers} & \textbf{PoS} & \textbf{Gender} &
10269 \textbf{Number} &&& \\\cline{2-7} (todos) & predet & m & sg &&&\\ & &
10270 f & pl &&&\\ & & nt & sp &&&\\\hline \textbf{Interjections} &
10271 \textbf{PoS} &&&&&\\\cline{2-7} (hola) & ij &&&&&\\\hline
10276 \textbf{Common nouns}& \textbf{PoS} & \textbf{Gender} &
10277 \textbf{Number} &&& \\\cline{2-7} (casa, perro) & n & m & sg &&&\\ & n
10278 & f & pl &&&\\ & n & mf & sp &&&\\\hline
10280 \textbf{Proper nouns}& \textbf{PoS} &\textbf{Type}&&&&\\\cline{2-7}
10281 (Pedro, Londres) & np & ant &&&&\\ & & loc &&&&\\ & & al &&&&\\\hline
10283 \end{tabular} \newpage
10284 \begin{tabular}{|l|llllll|} \hline
10286 \textbf{Acronyms} & \textbf{PoS} & \textbf{Type} & \textbf{Gender} &
10287 \textbf{Number} && \\\cline{2-7} (IRPF, INEM) & n & acr & m & sg &&\\
10288 & & & f & pl &&\\ & & & mf & sp &&\\\hline
10291 \textbf{Numerals} & \textbf{PoS} & \textbf{Gender} & \textbf{Number}
10292 &&& \\\cline{2-7} (tres) & num & m & sg &&& \\ & & f & pl &&& \\ & &
10293 mf & sp &&& \\\hline
10295 \textbf{Prepositions} & \textbf{PoS} &&&&&\\\cline{2-7} (de, por) & pr
10296 &&&&&\\\hline
10298 \textbf{Interrogative pronouns} & \textbf{PoS} & \textbf{Type} &
10299 \textbf{Gender} &\textbf{Number}&& \\\cline{2-7} (quién, qué) & prn &
10300 itg & m & sg &&\\ & & & f & pl &&\\\hline
10303 \textbf{Enclitic, proclitic and} & \textbf{PoS} & \textbf{Type} &
10304 \textbf{Person}& \textbf{Gender} &\textbf{Number}& \\\cline{2-7}
10305 \textbf{tonic personal} & prn & enc & p1 & m & sg &\\
10306 \textbf{pronouns} & & pro & p2 & f & pl &\\ (yo, vosotros,
10307 ayudar\textbf{te}, & & tn & p3 & mf & sp & \\ \textbf{te} ayudo) &
10308 & & & nt && \\ & & & & & & \\\cline{2-7}
10309 \textbf{Procl. reflexive pron.} (se): & prn & pro & ref & p3 & mf &
10310 sp\\\cline{2-7} \textbf{Tonic reflex. pron.} (si): & prn & tn & ref &
10311 p3 & mf & sp\\\hline
10315 \textbf{Tonic possessive pron.} & \textbf{PoS} & \textbf{Type} &
10316 \textbf{Subtype}& \textbf{Gender} &\textbf{Number}& \\\cline{2-7}
10317 (mío, suyo) & prn & tn & pos & m & sg &\\ & & & & f & pl &\\\hline
10320 \textbf{Other tonic pronouns} & \textbf{PoS} & \textbf{Type} &
10321 \textbf{Gender} &\textbf{Number}&& \\\cline{2-7} (aquella, nadie,
10322 otro) & prn & tn & m & sg &&\\ & & & f & pl &&\\ & & & mf & sp && \\ &
10323 & & nt &&& \\\hline
10326 \textbf{Pronominal and adjectival} & \textbf{PoS} & \textbf{Type} &
10327 \textbf{Gender} & \textbf{Number} && \\\cline{2-7} \textbf{relatives}
10328 & rel & nn & m & sg &&\\ (que, cuyo) & & an & f & pl &&\\ & & aa & f &
10329 pl &&\\\hline
10331 \textbf{Adverbial relatives} & \textbf{PoS} & \textbf{Type} & & &&
10332 \\\cline{2-7} (como, donde) & rel & adv & & &&\\\hline
10335 \textbf{Verbs} & \textbf{Type} & \textbf{Tense} & \textbf{Person}
10336 &\textbf{Number}&& \\ \textbf{(personal forms)} & & \textbf{and mode}
10337 & & && \\\cline{2-7} (subo, vamos) & vblex & cni & p1 & sg &&\\ &
10338 vbser & fti & p2 & pl &&\\ & vbhaver & fts & p3 & &&\\ & vbmod & ifi &
10339 & &&\\ & & imp & & &&\\ & & pii & & &&\\ & & pis & & &&\\ & & pri & &
10340 &&\\ & & prs & & &&\\\hline
10343 \textbf{Verbs} & \textbf{Type} & \textbf{Form} & & && \\\cline{2-7}
10344 \textbf{(infinitive and gerund)} & vblex & inf & & &&\\ (cantar,
10345 buscando) & vbser & ger & & &&\\ & vbhaver & & & &&\\ & vbmod & & &
10346 &&\\\hline
10350 \textbf{Verbs} & \textbf{Type} & \textbf{Form} &\textbf{Gender}
10351 &\textbf{Number} && \\\cline{2-7} \textbf{(participle)} & vblex & pp &
10352 m & sg &&\\ (dormido, cansadas) & vbser & & f & pl &&\\ & vbhaver & &
10353 & &&\\ & vbmod & & & &&\\\hline
10357 \end{tabular}
10358 \end{footnotesize}
10361 \newpage
10362 \section{Categories used in the part-of-speech tagger}
10363 \subsection{Spanish tagger}
10365 These are the categories or coarse tags used by the Spanish
10366 part-of-speech tagger.
10369 \begin{footnotesize}
10370 \begin{longtable}{l|l|c|l} \hline \bf{Tag} & \bf{Description} &
10371 \bf{Closed} & \bf{Examples} \\ \hline \hline
10372 \endhead \multicolumn{4}{c}{\bf{Simple tags}} \\ \hline \hline PARAPR
10373 & Lexicalization of \emph{para} as a preposition & Yes & \\ \hline
10374 PARAVBPRI & Lexicalization of \emph{para} as a lexical verb & & \\ &
10375 in present indicative & Yes& \\ \hline PARAVBIMP & Lexicalization of
10376 \emph{para} as a lexical verb & & \\ & in imperative & Yes& \\ \hline
10377 QUECNJ & Lexicalization of \emph{que} as a conjunction & Yes& \\
10378 \hline QUEREL & Lexicalization of \emph{que} as a relative pronoun &
10379 Yes& \\ \hline COMOPR\footnote{The morphological analyser considers
10380 that \emph{como} can be a preposition since it can be replaced with
10381 \emph{en calidad de} in some contexts (e.g.- \emph{'Os hablo como
10382 director de la película'}).} & Lexicalization of \emph{como} as a
10383 preposition& Yes& \\
10384 %!!!!!!!!!!Explicar esto de la preposición porque no es muy estándar que digamos
10385 \hline COMOREL & Lexicalization of \emph{como} as a
10386 relative pronoun & Yes& \\ \hline COMOVB & Lexicalization of
10387 \emph{como} as a lexical verb & & \\ & in present indicative & Yes& \\
10388 \hline MASADV & Lexicalization of \emph{más}/\emph{menos} as an adverb
10389 & Yes& \\ \hline MASADJ & Lexicalization of \emph{más}/\emph{menos} as
10390 an adjective & Yes& \\ \hline MASNP & Lexicalization of \emph{Más} as
10391 a proper noun & Yes& \\ \hline ALGOADV & Lexicalization of \emph{algo}
10392 as an adverb & Yes& \\ \hline ACRONIMOM & Acronym & No& BCH\\ \hline
10393 ACRONIMOF & Acronym & No& ONU\\ \hline ACRONIMOMF & Acronym & No&
10394 ATS\\ \hline INTNOM & Interrogative pronoun & Yes& quién, cuál\\
10395 \hline ADJINT & Interrogative adjective & Yes& cuánto, qué\\ \hline
10396 INTADV & Interrogative adverb & Yes& cuándo, dónde\\ \hline PREADV &
10397 Adverb that can precede another & &\\& adverb or an adjective & Yes&
10398 muy, bien, mal\\ \hline ADV & Adverb & No& nunca, ahí\\ \hline CNJSUBS
10399 & Subordinating conjunction & Yes& que\\
10400 %!!!!!!! No hay más conjunciones subordinadas a parte de que?????
10401 \hline CNJCOORD &
10402 Co-ordinating conjunction & Yes& y, pero\\ \hline CNJADV & Adverbial
10403 conjunction & No& si\\ \hline DETNT & Neuter determiner & Yes& lo\\
10404 \hline DETM & Determiner & Yes& el, un\\ \hline DETF & Determiner &
10405 Yes& la, una\\ \hline DETMF & Determiner & Yes& cada\\ \hline INTERJ &
10406 Interjection & No& ojalá, hola\\ \hline NOM & Noun & No& casa, coche\\
10407 \hline ANTROPONIM & Proper noun for person & No& Fernando\\ \hline
10408 TOPONIM & Proper noun for place & No& Alicante\\ \hline NPALTRES &
10409 Other proper nouns & No& Linux, Seat\\ \hline NUM & Numeral & Yes&
10410 tres, cuatro\\ \hline PREDETNT & Neuter predeterminer & Yes& todo\\
10411 \hline PREDET & Predeterminer & Yes& toda\\ \hline PREP & Preposition
10412 & Yes& ante, desde\\ \hline PRNTNNT & Neuter tonic pronoun & Yes&
10413 algo, esto\\ \hline PRNTN & Tonic pronoun & Yes& ambos, nadie\\ \hline
10414 PRNENCREF & Reflexive enclitic pronoun & Yes& se \\ \hline PRNPROREF &
10415 Reflexive proclitic pronoun & Yes& se \\ \hline PRNENC & Enclitic
10416 pronoun & Yes& me, nos\\ \hline PRNPRO & Proclitic pronoun & Yes& le,
10417 te\\ \hline VLEXINF & Lexical verb in infinitive & No& cantar, reír\\
10418 \hline VLEXGER & Lexical verb in gerund & No& hablando\\ \hline
10419 VLEXPARTPI & Lexical verb in participle & No& dicho, cantado\\ \hline
10420 VLEXPFCI & Lexical verb in present, future or & & \\ & conditional
10421 indicative & No& digo, diré, diría\\ \hline VLEXIPI & Lexical verb in
10422 imperfect preferite or & & \\ & perfect preterite indicative & No&
10423 cantaba, dijo\\ \hline VLEXSUBJ & Lexical verb in subjunctive & No&
10424 hablase, dijeramos\\ \hline VLEXIMP & Lexical verb in imperative & No&
10425 canta, comed\\ \hline VSERINF & Verb \emph{to be} in infinitive & Yes&
10426 ser\\ \hline VSERGER & Verb \emph{to be} in gerund & Yes& siendo\\
10427 \hline VSERPARTPI & Verb \emph{to be} in participle & Yes& sido\\
10428 \hline VSERPFCI & Verb \emph{to be} in present, future or & & \\ &
10429 conditional indicative & Yes& soy, seré, sería\\ \hline VSERIPI & Verb
10430 \emph{to be} in imperfect preterite or & & \\ & perfect preterite
10431 indicative & Yes& era, fui\\ \hline VSERSUBJ & Verb \emph{to be} in
10432 subjunctive & Yes& fueras\\ \hline VSERIMP & Verb \emph{to be} in
10433 imperative & Yes& sé\\ \hline VHABERINF & Verb \emph{to have} in
10434 infinitive & Yes& haber\\ \hline VHABERGER & Verb \emph{to have} in
10435 gerund & Yes& habiendo\\ \hline VHABERPARTPI & Verb \emph{to have} in
10436 participle & Yes& habido\\ \hline VHABERPFCI & Verb \emph{to have} in
10437 present, future or & & \\ & conditional indicative & Yes& hay, habrán,
10438 habría\\ \hline VHABERIPI & Verb \emph{to have} in imperfect preterite
10439 or & & \\ & perfect preterite indicative & Yes& había, hubo\\ \hline
10440 VHABERSUBJ & Verb \emph{to have} in subjunctive & Yes& hubieran\\
10441 \hline VMODALINF & Modal verb in infinitive & Yes& deber, poder\\
10442 \hline VMODALGER & Modal verb in gerund & Yes& debiendo\\ \hline
10443 VMODALPARTPI & Modal verb in participle & Yes& podido\\ \hline
10444 VMODALPFCI & Modal verb in present, future or & & \\ & conditional
10445 indicative & Yes& puede, deberá, podría\\ \hline VMODALIPI & Modal
10446 verb in imperfect preterite or & & \\ & perfect preterite indicative &
10447 Yes& podía, debió\\ \hline VMODALSUBJ & Modal verb in subjunctive &
10448 Yes& pudiese, debiéramos\\ \hline VMODALIMP & Modal verb in imperative
10449 & Yes& poded, debed\\ \hline ADJM & Adjective & No& gracioso\\ \hline
10450 ADJF & Adjective & No& graciosa\\ \hline ADJMF & Adjective & No&
10451 inteligente\\ \hline ADJPOS & Possessive adjective & Yes& mío\\ \hline
10452 REL & Relative pronoun & Yes& quien, cuya\\ \hline RELADV & Adverbial
10453 relative & Yes& cuando, donde\\ \hline \hline
10454 \multicolumn{4}{c}{\bf{Compound tags}} \\ \hline \hline PREPDET &
10455 Contraction of preposition and determiner & Yes& del, al\\ \hline
10456 PRCNJ & Multiword made of preposition and & & \\ &conjunction & Yes& a
10457 que\\ \hline PRREL & Multiword made of preposition and & & \\
10458 &relative & Yes& en que\\ \hline INFLEXPRNENC & Lexical verb in
10459 infinitive with enclitics & No& dármelo, cantarlo\\ \hline
10460 GERLEXPRNENC & Lexical verb in gerund with enclitics & No&
10461 cantándosela\\ \hline IMPLEXPRNENC & Lexical verb in imperative with
10462 enclitics & No& dímelo\\ \hline INFSERPRNENC & Verb \emph{to be} in
10463 infinitive with enclitics & Yes& serlo\\ \hline GERSERPRNENC & Verb
10464 \emph{to be} in gerund with enclitics & Yes& siéndolo\\ \hline
10465 IMPSERPRNENC & Verb \emph{to be} in imperative with enclitics & Yes&
10466 sedlo\\ \hline INFHABPRNENC & Verb \emph{to have} in infinitive with
10467 enclitics & Yes& habérsela\\ \hline GERHABPRNENC & Verb \emph{to have}
10468 in gerund with enclitics & Yes& habiéndole\\ \hline INFMODPRNENC &
10469 Modal verb in infinitive with enclitics & Yes& poderla, deberlo\\
10470 \hline GERMODPRNENC & Modal verb in gerund with enclitics& Yes&
10471 debiéndosela\\ \hline IMPMODPRNENC & Modal verb in imperative with
10472 enclitics& Sí& debédmela\\ \hline \hline \multicolumn{4}{c}{\bf{Other
10473 tags}} \\ \hline \hline LQUEST & Opening question mark & & ¿ \\ \hline
10474 LPAR & Opening parenthesis or square bracket & & (, [ \\ \hline RPAR &
10475 Closing parenthesis or square bracket & & ), ] \\ \hline CM & Comma &
10476 & , \\ \hline SENT & Sentence end character & & ., :, ;, ?, !\\ \hline
10477 \hline \multicolumn{4}{l}{}\\ %p{0.50\textwidth}
10479 \end{longtable}
10480 \end{footnotesize}
10482 \subsection{Catalan tagger}
10484 Due to the similarity of the Catalan tagger categories and the Spanish
10485 ones, we list here only the tags that are new or different in the
10486 Catalan tagger.
10488 \begin{footnotesize}
10489 \begin{longtable}{l|l|c|l} \hline \bf{Tag} & \bf{Description} &
10490 \bf{Closed} & \bf{Examples} \\ \hline \hline
10491 \endhead \multicolumn{4}{c}{\bf{Simple tags}} \\ \hline \hline MOLTADV
10492 & Lexicalization of \emph{molt}/\emph{gaire} as an adverb & Yes & \\
10493 \hline MOTLPREADV & Lexicalization of \emph{molt}/\emph{gaire} as an
10494 adverb & Yes& \\ \hline VOLERMOD & Lexicalization of \emph{voler} as a
10495 modal verb & Yes& \\ \hline VOLERLEX & Lexicalization of \emph{voler}
10496 as a lexical verb & Yes& \\ \hline VA & Lexicalization of \emph{va} as
10497 a form of the verb \emph{anar} & Yes& \\ \hline \multicolumn{4}{l}{}\\
10498 %p{0.50\textwidth}
10499 \end{longtable}
10500 \end{footnotesize}
10502 \subsection{Galician tagger}
10505 Due to the similarity of the Galician tagger categories and the
10506 Spanish ones, we list here only the tags that are new or different in
10507 the Galician tagger.
10510 \begin{footnotesize}
10511 \begin{longtable}{l|l|c|l} \hline \bf{Tag} & \bf{Description} &
10512 \bf{Closed} & \bf{Examples} \\ \hline \hline
10513 \endhead \multicolumn{4}{c}{\bf{Simple tags}} \\ \hline \hline VBIRNPS
10514 & Lexicalization of \emph{to go} in infinitive & & \\ & and gerund &
10515 Yes & \\ \hline VBIRPARTPI & Lexicalization of \emph{to go} in
10516 participle & Yes& \\ \hline VBIRPS & Lexicalization of \emph{to go} in
10517 the personal forms & & \\ & of indicative & & \\ & and subjunctive &
10518 Yes& \\ \hline VBIRIMP & Lexicalization of \emph{to go} in imperative
10519 & Yes& \\ \hline VHABERNPS & Lexicalization of \emph{to have} in
10520 infinitive & & \\ & and gerund & Yes & \\ \hline VHABERPARTPI &
10521 Lexicalization of \emph{to have} in participle & Yes& \\ \hline
10522 VHABERPS & Lexicalization of \emph{to have} in the personal forms & &
10523 \\ & of indicative & & \\ & and subjunctive & Yes& \\ \hline VHABERIMP
10524 & Lexicalization of \emph{to have} in imperative & Yes& \\ \hline
10525 APREP & Lexicalization of \emph{a} as a preposition & Yes& \\ \hline
10526 VLEXNPS & Lexical verb: infinitive and gerund & No& achegar,
10527 achegándomos\\ \hline VLEXPS & Lexical verb: personal forms & & \\ &
10528 in indicative & No& achegue, achegaré\\ \hline VSERNPS & Verb \emph{to
10529 be}: infinitive and gerund & Yes& ser, seres\\ \hline VSERPS & Verb
10530 \emph{to be}: personal forms & & \\ & in indicative& Yes& fosen, es\\
10531 \hline \hline \multicolumn{4}{c}{\bf{Compound tags}} \\ \hline \hline
10532 PREPDETM &
10533 Contraction of preposition and & & \\ & masculine determiner & Yes&
10534 do, ao\\ \hline PREPDETF & Contraction of preposition and & & \\ &
10535 feminine determiner & Yes& da, ás\\ \hline PREPDETN & Contraction of
10536 preposition and & & \\ & neuter determiner & Yes& do\\ \hline
10537 PREPDETDET & Contraction of preposition and & & \\ & two determiners &
10538 Yes& destoutro\\ \hline PREPPRTNNT & Contraction of preposition and &
10539 & \\ &neuter tonic pronoun & Yes& daquilo\\ \hline PREPPRNTN &
10540 Contraction of preposition and & & \\ &tonic pronoun & Yes&
10541 daqueloutra\\ \hline PREPTNTN & Contraction of preposition and & & \\
10542 & two tonic pronouns & Yes& nestoutra\\ \hline PREPNUM & Contraction
10543 of preposition and & & \\ & numeral & Yes& dunha\\ \hline PREDETDET &
10544 Contraction of predeterminer and & & \\ & determiner & Yes& tódalas\\
10545 \hline INTADVDET & Contraction of adverbial interrogative and & & \\
10546 & determiner & Yes& u-la\\ \hline DETDETM & Contraction of two masculine
10547 determiners & Yes& ámbolos\\ \hline DETDETF & Contraction of two
10548 feminine determiners& Yes& ámbalas\\ \hline PRNPRN & Contraction of
10549 two tonic pronouns & Yes& esoutra\\ \hline PRNPRN & Contraction of two
10550 proclitic pronouns & Yes& chas\\ \hline CNJCDET & Contraction of
10551 co-ordinating conjunction and & & \\ & determiner & Yes& maila\\
10552 \hline CNJSUB & Contraction of subordinating conjunction and & &
10553 \\ & determiner & Yes& cás\\ \hline \hline \multicolumn{4}{l}{}\\
10554 %p{0.50\textwidth}
10555 \end{longtable}
10556 \end{footnotesize}
10559 \newpage
10560 \chapter{Abbreviations used in the text}
10561 \label{se:apendiceabrev}
10562 \begin{description}
10563 \item[ANSI] American National Standards Institute; when used
10564 informally in the expression \emph{ANSI text}, it refers to a text
10565 encoded in any of the encodings of one byte per character defined in
10566 the standard ISO-8859 \cite{Unicode}.
10567 \item[ca] ISO 639 two-letter code\footnote{See
10568 \texttt{\url{http://www.w3.org/WAI/ER/IG/ert/iso639.htm}}} for
10569 Catalan
10570 \item[DTD] Document type definition in XML
10571 \item[es] ISO 639 two-letter code for Spanish
10572 \item[eu] ISO 639 two-letter code for Basque
10573 \item[LF] Lexical form (see page~\pageref{pg:FSFL})
10574 \item[TLLF] Target language lexical form
10575 \item[SLLF] Source language lexical form
10576 \item[SF] Surface form (see page~\pageref{pg:FSFL})
10577 \item[gl] ISO 639 two-letter code for Galician
10578 \item[pt] ISO 639 two-letter code for Portuguese
10579 \item[HTML] Hypertext markup language
10580 \item[TL] Target language
10581 \item[SL] Source language
10582 \item[RTF] Rich text format
10583 \item[MT] Machine translation
10584 \item[XML] Extensible markup language
10585 \item[POS] Part of speech
10586 \nota{order alfabetically}
10587 \end{description}
10589 \newpage \nota{Afegir article de l'EAMT 2005 i citar-lo}
10590 \bibliography{documentation} \bibliographystyle{plain}
10591 \addcontentsline{toc}{chapter}{Bibliografía}
10592 \end{document}