apertium2-documentation-en/documentation.tex

   1 \documentclass [12pt,a4paper]{book}
   2 %\ifx\pdfoutput\undefined
   3 %\usepackage[dvips]{graphicx}
   4 %\else
   5 %\usepackage[pdftex]{graphicx}
   6 %\usepackage{type1cm}
   7 %\fi
   8 \usepackage[dvips]{graphicx}
   9 \usepackage{rotating}
  10 \usepackage{palatino,helvet}
  11 \usepackage[english]{babel}
  12 \usepackage[latin1]{inputenc}
  13 \usepackage{sectsty}
  14 \usepackage{alltt}
  15 \usepackage[small,bf]{caption}
  16 \usepackage{url}
  17 \usepackage{rotating}
  18 \usepackage{longtable}
  19 % \usepackage{tocvsec2}
  20
  21 % \allsectionsfont{\sffamily}
  22 %
  23
  24
  25 \setcounter{secnumdepth}{3}
  26 %\setcounter{tocdepth}{3} %%(so that index reaches the third level, more specific)
  27
  28 % Line break after \paragraph
  29 \makeatletter % so that '@' is recognized as a normal character
  30 \renewcommand{\paragraph}{\@startsection{paragraph}{4}{\z@}{-3.25ex \@plus
  31 -1ex \@minus -.2ex}{1.5ex \@plus .2ex}{\normalfont\normalsize\bfseries}}
  32 \makeatother % so that '@' is again a special character
  33
  34
  35
  36 %  \newcommand{\nota}[1]{ \begin{small}
  37 %   \begin{quote}
  38 %   \begin{sf}
  39 %   [Nota: #1]
  40 %   \end{sf}
  41 %   \end{quote}
  42 %   \end{small}
  43 % }
  44
  45 \newcommand{\nota}[1]{}
  46
  47
  48 \newcommand{\notavisible}[1]{
  49   \begin{small}
  50     \begin{quote}
  51       \begin{sf}
  52         [#1]
  53       \end{sf}
  54     \end{quote}
  55   \end{small}
  56 }
  57
  58
  59 %% Project ``Open Source Machine Translation for the languages of Spain (FIT-340101-2004-3) \\[.5ex]
  60 \frontmatter
  61
  62 \title{\sffamily\bfseries Documentation of the Open-Source
  63 Shallow-Transfer Machine Translation Platform \emph{Apertium}}
  64 %%\date{28 June 2005}
  65
  66
  67
  68  \author{\textbf{AUTHORS}:\\Mikel L. Forcada\\Boyan Ivanov
  69 Bonev\\Sergio Ortiz Rojas\\ Juan Antonio Pérez Ortiz \\
  70 Gema Ramírez Sánchez\\Felipe Sánchez
  71  Martínez\\ Carme Armentano-Oller\\ Marco A.\ Montava \\ Francis M.\ Tyers\\\\\textbf{EDITOR}:\\Mireia Ginestí
  72  Rosell\\[0.8cm]\\Departament de Llenguatges i Sistemes
  73  Informàtics\\Universitat d'Alacant}
  74 %% \textit{Eleka Ingeniaritza Linguistikoa} \\
  75 %% \textit{Zelai Haundi Kalea, 3} \\
  76 %% \textit{Osinalde Industrialdea} \\
  77 %% \textit{20170 Usurbil}}
  78
  79
  80
  81 \begin{document}
  82 \pagestyle{headings}
  83 %\maxtocdepth{subsubsection}
  84 %\maxtocdepth{paragraph}
  85
  86 %\settocdepth{subsubsection}
  87
  88 \maketitle
  89
  90
  91 \newpage \thispagestyle{empty}
  92
  93 \bigskip
  94 \begin{quote} Copyright \copyright 2007 Grup Transducens, Universitat
  95   d'Alacant.  Permission is granted to copy, distribute and/or modify
  96   this document under the terms of the GNU Free Documentation License,
  97   Version 1.2 or any later version published by the Free Software
  98   Foundation; with no Invariant Sections, no Front-Cover Texts, and no
  99   Back-Cover Texts. A copy of the license can be found in
 100   \url{http://www.gnu.org/copyleft/fdl.html}.
 101
 102   % The unofficial
 103 %   translation of the license to Spanish can be found at
 104 %   \url{http://gugs.sindominio.net/licencias/gfdl-1.2-es.html}, the
 105 %   unofficial translation to Catalan can be found at
 106 %   \url{http://www.softcatala.org/llicencies/fdl-ca.html}, and the
 107 %   unofficial translation to Galician can be found at
 108 %   \url{http://members.tripod.com.br/RamonFlores/GNU/gpl.html}.
 109
 110
 111 \notavisible{Shouldn't we license this under GPL or another license that is free in Debian terms?
 112
 113 Make sure 1.2 is the right license.
 114
 115 Perhaps we don't want ``or any later version''.
 116
 117 Check the author list. We might have forgotten someone.}
 118
 119 \end{quote}
 120
 121
 122 \bigskip
 123
 124
 125 \tableofcontents
 126
 127 \newpage
 128
 129 \mainmatter
 130 \chapter*{Introduction}\addcontentsline{toc}{chapter}{Introduction}
 131
 132
 133 This documentation describes the Apertium platform, one of the
 134 open-source machine translation systems which originated within the
 135 project "Open-Source Machine Translation for the Languages of Spain"
 136 ("Traducción automática de código abierto para las lenguas del estado
 137 español")\nota{Posem un resum de les dades del projecte en un apèndix
 138   (codi, durada, finançament, participants, etc.) i hi fem referència
 139   ací? - Mikel}. It is a shallow-transfer machine translation system,
 140 initially designed for the translation between related language pairs,
 141 although some of its components have been also used in the
 142 deep-transfer architecture (\emph{Matxin}) that has been developed in
 143 the same project for the pair Spanish-Basque. \emph{Apertium} can
 144 translate at present between the pairs Spanish-Galician,
 145 Spanish--Catalan\footnote{With the name \emph{Catalan} we refer also
 146   to the Valencian dialectal variant of this language.}
 147 Catalan-Occitan, Catalan-French, and can be used to build translators
 148 between other related language pairs, such as
 149 Danish-Swedish,Czech--Slovak, etc.  \notavisible{Update the
 150   language-pair list!}  \notavisible{I think it is very important to say in this paragraph that the system has been extended}
 151
 152
 153 \notavisible{The next paragraph needs updating or generalizing:}
 154 Existing machine translation systems available at present for the
 155 pairs \texttt{es}--\texttt{ca} and \texttt{es}--\texttt{gl} are mostly
 156 commercial or use proprietary technologies, which makes them very hard
 157 to adapt to new usages; furthermore, they use different technologies
 158 across language pairs, which makes it very difficult to integrate them
 159 in a single multilingual content management system.
 160
 161 One of the main novelties of the architecture described here is that
 162 it has been released under open-source licenses (in most cases, GNU
 163 GPL; some data still have a Creative Commons license) and is
 164 distributed free of charge. This means that anyone having the
 165 necessary computational and linguistic skills will be able to adapt or
 166 enhance the platform or the language-pair data to create a new machine
 167 translation system, even for other pairs of related languages. The
 168 licenses chosen make these improvements immediately available to
 169 everyone.  We therefore expect that the introduction of this of
 170 open-source machine translation architecture will solve some of the
 171 mentioned problems (having different technologies for different pairs,
 172 closed-source architectures being hard to adapt to new uses, etc.) and
 173 promote the exchange of existing linguistic data through the use of
 174 the XML-based formats defined in this documentation. On the other
 175 hand, we think that it will help shift the current business model from
 176 a license-centred one to a services-centred one.
 177
 178 It is worth mentioning that "Open-Source Machine Translation for the
 179 Languages of Spain" was the first large open-source machine
 180 translation project funded by the central Spanish Government, although
 181 the adoption of open-source software by the Spanish governments is not
 182 new.
 183
 184 \notavisible{Don't forget about the other funding agencies supporting open source MT; this needs some contextualization, relating to funding, etc. Mention later funding and refer to the appropriate section.}
 185
 186 This documentation describes in detail the characteristics of the
 187 Apertium platform, and is organized as follows:
 188
 189
 190 \begin{itemize}
 191 \item Chapter \ref{ss:descrarq}: \textbf{general description} of the
 192 shallow-transfer machine translation system and of the modules that
 193 make it up.
 194
 195 \item Chapter \ref{se:flujodatos}: description of the \textbf{format
 196 of the data stream} that circulates from one module to the next one.
 197
 198 \item Chapter \ref{se:especificmodulos}: \textbf{specification of the
 199 modules} of the system. For each module there is a description of: the
 200 \textit{program} and its characteristics,  the \textit{format of the data}
 201 that the module uses, and the \textit{compilers}  used for it.
 202 This chapter is divided in the following sections:
 203   \begin{itemize}
 204   \item [-]Section \ref{ss:modproclex}: \emph{Lexical processing
 205     modules}, where the morphological analyser, the lexical transfer
 206     module, the morphological generator and the post-generator are
 207     described (Section \ref{ss:funcproclex}), along with the format of
 208     the dictionaries used by these modules (section
 209     \ref{ss:diccionarios}) and their compilers (section
 210     \ref{se:compiladoresdic})
 211   \item [-]Section \ref{ss:tagger}: \emph{Part-of-speech Tagger},
 212     which describes the tagger (Section \ref{functagger}) and the
 213     format of the linguistic data used by the tagger (section
 214     \ref{datostagger}.
 215 % MLF 20060328 elimina % y el compilador % correspondiente (apartado
 216 %\ref{ss:gentagger})
 217
 218 \nota{falta parlar del lextor, i afegir-ho a tot arreu on es parli
 219 dels mòduls del sistema}
 220
 221   \item [-]Section \ref{se:pretransfer}: \emph{Pre-transfer module},
 222     which describes the module that runs before the structural
 223     transfer module to perform some operations on multiword units
 224   \item [-]Section \ref{ss:transfer}: \emph{Structural transfer
 225     module}, where there is a description of the program (section
 226     \ref{functransfer}) and of the format of the structural transfer
 227     rules (Section \ref{formatotransfer}).
 228 % MLF 20060328 % y el % compilador correspondiente (apartado
 229 % \ref{gentransfer})
 230   \item [-]Section \ref{se:desformat}: \emph{De-formatter and
 231     Re-formatter}, which describes these modules (section
 232     \ref{ss:formato}), the rules for format processing (section
 233     \ref{ss:reglasformato}) and how these modules are generated
 234     (Section \ref{se:gendeformat})
 235
 236   \end{itemize}
 237
 238
 239
 240 \item Chapter \ref{se:instalacion}: it describes the way to
 241 \textbf{install the system} and to \textbf{run the translator}.
 242
 243 \item Chapter \ref{se:datosling}: here you will find an explanation of
 244   how to \textbf{modify the linguistic data} used by the translator,
 245   that is, the dictionaries, the part-of-speech disambiguation data
 246   and the structural transfer rules created in this project for
 247   Spanish, Catalan and Galician. Furthermore, it contains a brief
 248   description of the characteristics of the
 249 available data for these three language pairs.
 250 \notavisible{I would try to be more general, and perhaps remove this section or update with some other pairs. Any ideas on how to do this?}
 251
 252
 253 \nota{Es diuen a tot arreu els noms de programa i en quin paquet
 254 estan?}
 255
 256
 257 \end{itemize}
 258
 259
 260 The files which this documentation refers to can be found at and
 261 downloaded from the project web page in Sourceforge:
 262 \url{http://apertium.sourceforge.net/}.  From this page you can
 263 download the packages needed for installation, as well as view the
 264 individual files in the SVN (main) and CVS (residual) repositories of
 265 the project.  The machine translation systems for the different
 266 language pairs can also be tested in Internet at
 267 \url{http://xixona.dlsi.ua.es/apertium/}.
 268
 269 \notavisible{Shouldn't we mention the debugging interfaces?}
 270 \notavisible{Should we define SVN and CVS?}
 271
 272 %El presente documento tiene algunas secciones que están incompletas o
 273 %no han sido escritas todavía.
 274
 275
 276 \paragraph*{Acknowledgements:} The present work has benefited from the
 277 contribution of many people and institutions:
 278 \begin{itemize}
 279 \item The Spanish Ministry of Industry, Commerce and Tourism has
 280   funded the development of this toolbox through the projects
 281   ``Open-Source Machine Translation for the Languages of Spain'', code
 282   FIT-340101-2004-3, and its extension FIT-340001-2005-2, and
 283   ``EurOpenTrad: Open-Source Advanced Machine Translation for the
 284   European Integration of the Languages of Spain'', code
 285   FIT-350101-2006-5, all of them belonging to the PROFIT program.
 286
 287
 288
 289 \item Workers and scholars from other machine translation projects at
 290 the Universitat d'Alacant: Míriam Antunes Scalco, Carme Armentano i
 291 Oller, Raül Canals i Marote, Alicia Garrido Alenda, Patrícia Gilabert
 292 i Zarco, Maribel Guardiola i Savall, Javier Herrero Vicente, Amaia
 293 Iturraspe Bellver, Sandra Montserrat i Buendia, Hermínia Pastor Pina,
 294 Antonio Pertusa Ibáñez, Francisco Javier Ramos Salas, Marcial Samper
 295 Asensio and Miguel Sánchez Molina.
 296 \item The companies and institutions that have funded these other
 297 machine translation projects: Spanish Ministry of Science and
 298 Technology, Caja de Ahorros del Mediterráneo, Universitat d'Alacant
 299 and Portal Universia, S.A.
 300 \item Iñaki Alegria, from the Ixa group of the Euskal Herriko
 301 Unibertsitatea (University of the Basque Country), for his close
 302 reading of previous versions of this document.
 303 \end{itemize}
 304
 305 \vspace{12cm}
 306
 307
 308
 309
 310 \chapter[The translation engine]{The shallow-transfer machine
 311 translation engine }
 312 \label{ss:descrarq}
 313
 314
 315 This chapter describes briefly the structure of the shallow-transfer
 316 machine translation engine, which is largely based on that of the
 317 existing systems for Spanish--Catalan \textsf{interNOSTRUM}
 318 \cite{canals01b,garridoalenda01p,garrido99j} and for
 319 Spanish--Portuguese \textsf{Traductor Universia} \cite{garrido03p,
 320 gilabert03j}, both developed by the Transducens group of the
 321 Universitat d'Alacant.  It is a classical indirect translation system
 322 that uses a partial syntactic transfer strategy similar to the one
 323 used by some commercial MT systems for personal computers.
 324
 325
 326 The design of the system makes it possible to produce MT systems that
 327 are \emph{fast} (translating tens of thousands of words per second in
 328 ordinary desktop computers) and that achieve results that are, in spite of
 329 the errors, reasonably intelligible and easily correctable. In the
 330 case of related languages such as the ones involved in the project
 331 (Spanish, Galician, Catalan), a mechanical word-for-word translation
 332 (with a fixed equivalent) would produce errors that, in most of the
 333 cases, can be solved with a quite rudimentary analysis (a
 334 morphological analysis followed by a superficial, local and partial
 335 syntactic analysis) and with an appropriate treatment of lexical
 336 ambiguities (mainly due to homography). The design of our system
 337 follows this approach with very interesting results. The Apertium
 338 architecture uses finite-state transducers for lexical processing,
 339 hidden Markov models for part-of-speech tagging and finite-state-based
 340 chunking for structural transfer.
 341
 342
 343 The translation engine consists of an 8-module \emph{assembly line},
 344 which is represented in Figure \ref{fg:modules}.  To ease diagnosis
 345 and independent testing, modules communicate between them using text
 346 streams.  This way, the input and output of the modules can be checked
 347 at any moment and, when an error in the translation process is
 348 detected, it is easy to test the output of each module separately to
 349 track down the origin of the error. At the same time, communication
 350 via text allows for some of the modules to be used in isolation,
 351 independently form the rest of the MT system, for other
 352 natural-language processing tasks, and enables the construction of
 353 prototypes with modified or additional modules.
 354
 355 We decided to encode linguistic data files in
 356 XML\footnote{\url{http://www.w3.org/XML/}}-based formats due to its
 357 interoperability, its independence on the character set and the
 358 availability of many tools and libraries that make easy the analysis
 359 of data  in this format. As stated in \cite{ide00}, XML is the
 360 emerging standard for data representation and exchange in
 361 Internet. Technologies around XML include very powerful mechanisms for
 362 accessing and editing XML documents, which will probably have a
 363 significant impact on the development of tools for natural language
 364 processing and annotated corpora.
 365
 366
 367 The modules Apertium consists of are the following:
 368
 369 \begin{figure*} {\footnotesize \setlength{\tabcolsep}{0.5mm}
 370 \begin{center}
 371 \begin{tabular}{cccccccc}
 372 \\
 373 \parbox{0.7cm}{SL text} \\
 374 $\downarrow$ \\
 375 \framebox{\parbox{1.4cm}{de\-formatter}} $\rightarrow$ &
 376 \framebox{\parbox{0.8cm}{morph. anal.}}  $\rightarrow$ &
 377 \framebox{\parbox{1.2cm}{PoS tagger}} $\rightarrow$ &
 378 \framebox{\parbox{1.1cm}{struct.\ transf.}} $\rightarrow$ &
 379 \framebox{\parbox{0.8cm}{morph. gen.}}  $\rightarrow$ &
 380 \framebox{\parbox{1.0cm}{post\-genera\-tor}} $\rightarrow$ &
 381 \framebox{\parbox{1.2cm}{re-format\-ter}} \\ & & & $\updownarrow$ & &
 382 & $\downarrow$ \\ & & & \framebox{\parbox{1.0cm}{lex.\ transfer}} & &
 383 & \parbox{0.7cm}{TL text}\\\\
 384 \end{tabular}
 385 \end{center} }
 386 \caption{The eight modules that build the assembly line of the
 387 shallow-transfer machine translation system.}
 388 \label{fg:modules}
 389 \label{pg:modules}
 390 \end{figure*}
 391
 392
 393
 394 \begin{itemize}
 395 \item The \emph{de-formatter}, which separates the text to be
 396 translated from the format information (RTF, HTML, etc.); its
 397 specification can be found in Section \ref{ss:formato}. Format
 398 information is encapsulated so that the rest of the modules treat it
 399 as blanks between words. For example, for the HTML text in Spanish:
 400 \begin{alltt}
 401  es <em>una señal</em>
 402 \end{alltt}
 403 ("it is a sign") the de-formatter encapsulates in brackets
 404 the HTML tags and gives the output:
 405 \begin{alltt}
 406 es [<em>]una señal[</em>]
 407 \end{alltt}
 408 The character sequences in brackets are treated by the
 409 rest of the modules as simple blanks between words.
 410 \item \label{pg:FSFL} The \emph{morphological analyser}, which
 411   tokenizes the text in \emph{surface forms} (SF) (lexical units as
 412   they appear in texts) and delivers, for each SF, one or more
 413   \emph{lexical forms} (LF) consisting of \emph{lemma} (the base form
 414   commonly used in classic dictionary entries), the \emph{lexical
 415   category} (noun, verb, preposition, etc.) and morphological
 416   inflection information (number, gender, person, tense,
 417   etc.). Tokenization of a text in SFs is not straightforward due to
 418   the existence, on the one hand, of contractions (in Spanish,
 419   \emph{del}, \emph{teniéndolo}, \emph{vámonos}; in English,
 420   \emph{didn't}, \emph{can't}) and, on the other hand, of lexical
 421   units made of more than one word (in Spanish, \emph{a pesar de},
 422   \emph{echó de menos}; in English, \emph{in front of}, \emph{taken
 423   into account}). The morphological analyser is able to analyse these
 424   complex SFs and treat them appropriately so that they can be
 425   processed by the next modules. In the case of contractions, the
 426   system reads a single surface form and gives as output a sequence of
 427   two or more lexical forms (for instance, the Spanish
 428   preposition-article contraction \emph{del} would be analysed into
 429   two lexical forms, one for the preposition \emph{de} and another one
 430   for the article \emph{el}). Lexical units made of more than one word
 431   (multiwords) are treated as single lexical forms and processed
 432   specifically according to its type.\footnote{For more information
 433   about the treatment of multiwords, please refer to page
 434   ~\pageref{ss:multipalabras}.}
 435
 436 Upon receiving as input the example text from the previous module, the
 437 morphological analyser would deliver:
 438 \begin{alltt}
 439 ^es/ser<vbser><pri><p3><sg>\$[ <em>]
 440 ^una/un<det><ind><f><sg>/unir<vblex><prs><1><sg>/unir
 441 <vblex><prs><3><sg>\$
 442 ^señal/señal<n><f><sg>\$[</em>]
 443 \end{alltt}
 444
 445 where each surface form has been analysed into one or more lexical
 446 forms: \emph{es} has been analysed as one SF with lemma \emph{ser}
 447 ("to be"), whereas \emph{una} receives three analyses: lemma \emph{un}
 448 ("one"), determiner, indefinite, feminine, singular; lemma \emph{unir}
 449 ("to join"), verb in subjunctive present, 1st person singular, and
 450 lemma \emph{unir}, verb in subjunctive present, 3rd person singular.
 451
 452 This module is generated from a source language (SL) morphological
 453 dictionary, the format of which is specified in section
 454 \ref{ss:diccionarios}.
 455 \item The \emph{part-of-speech tagger} chooses, using a statistical
 456 model (hidden Markov model), one of the analyses of an ambiguous word
 457 according to its context; in the previous example, the ambiguous word
 458 would be the surface form \emph{una}, which can have three different
 459 analyses. A sizeable fraction of surface forms (in Romance languages,
 460 for instance, around one out of every three words) are ambiguous, that
 461 is, they can be analysed into more than one lemma, more than one
 462 part-of-speech or have more than one inflection analysis, and are
 463 therefore an important source of translation errors when the wrong
 464 equivalent is chosen. The statistical model is trained on
 465 representative source-language text corpora.
 466
 467   The result of processing the example text delivered by the
 468   morphological analyser with the part-of-speech tagger would be:
 469
 470 \begin{alltt}
 471 ^ser<vbser><pri><p3><sg>\$[ <em>]^un<det><ind><f><sg>\$
 472 ^señal<n><f><sg>\$[</em>]
 473 \end{alltt}
 474
 475 where the correct lexical form (determiner) has been selected for the
 476 word \emph{una}.
 477
 478
 479   The specification of the part-of-speech tagger is in section
 480   \ref{ss:tagger}.
 481
 482
 483 \item The \emph{lexical transfer module}, that uses a bilingual
 484 dictionary and is called by the structural transfer module, reads each
 485 LF of the SL and delivers the corresponding target language (TL)
 486 lexical form. The dictionary contains a single equivalent for each SL
 487 lexical form; that is, no word-sense disambiguation is performed
 488 \nota{now not true: lextor}. Multiwords are translated as a single unit.
 489 The lexical forms in the running example would be translated into
 490 Catalan as follows:
 491
 492 \begin{alltt}
 493 ser<vbser> \(\longrightarrow\) ser<vbser>
 494 un<det> \(\longrightarrow\) un<det>
 495 señal<n><f> \(\longrightarrow\) senyal<n><m>
 496 \end{alltt}
 497
 498 This module is generated from a bilingual dictionary, which is
 499 described in Section \ref{ss:diccionarios}.
 500
 501 \item The \emph{structural transfer module}, which detects and
 502 processes patterns of words (chunks or phrases) that need special
 503 processing due to grammatical divergences between the two languages
 504 (gender and number changes, word reorderings, changes in prepositions,
 505 etc.). This module is generated from a file containing rules which
 506 describe the action to be taken for each pattern.  In the running
 507 example, the pattern formed by
 508 \verb!^!\texttt{un<det><ind><f><sg>}\verb!$!
 509 \verb!^!\texttt{señal<n><f><sg>}\verb!$! would be detected by a
 510 determiner--noun rule, which in this case would change the gender of
 511 the determiner so that it agrees with the noun; the result would be:
 512
 513 \begin{alltt}
 514 ^ser<vbser><pri><p3><sg>\$[ <em>]^un<det><ind><m><sg>\$
 515 ^senyal<n><m><sg>\$[</em>]
 516 \end{alltt}
 517
 518  The format of the structural transfer rules file, inspired in the one
 519  described in \cite{garridoalenda01p}, is specified in Section
 520  \ref{ss:transfer}.
 521 \item The \emph{morphological generator}, that, from a lexical form in
 522 the target language, generates a suitably inflected surface form. The
 523 result for the example phrase would be:
 524 \begin{alltt}
 525 és[ <em>]un senyal[</em>]
 526 \end{alltt}
 527
 528 This module is generated from a morphological dictionary, which is
 529 described in detail in Section \ref{ss:diccionarios}.
 530 \item The \emph{post-generator}, that performs some orthographic
 531 operations in the TL such as contractions and apostrophations, and
 532 which is generated from a transformation rules file the format of
 533 which is very similar to the format of the mentioned dictionaries. Its
 534 format is specified in Section \ref{ss:diccionarios}. In the example
 535 text there is no need to perform any contraction or apostrophation.
 536 \item The \emph{re-formatter}, which restores the original format
 537 information into the translated text; the result for the running
 538 example would be the correct conversion of the text into HTML format:
 539 \begin{alltt}
 540 és <em>un senyal</em>
 541 \end{alltt}
 542
 543
 544 The specification of the re-formatter is described in Section
 545 \ref{ss:formato}.
 546 \end{itemize}
 547
 548 The four lexical processing modules (morphological analyser, lexical
 549 transfer module, morphological generator and post-generator) use a
 550 single compiler, based on a class of \emph{finite-state transducers}
 551 \cite{garrido99j}, in particular, letter transducers
 552 \cite{roche97,ortiz05j}; its characteristics are described in Section
 553 \ref{se:compiladoresdic}.
 554
 555
 556
 557 \chapter[Stream format specification]{Format specification of the
 558 data stream between modules}
 559 \label{se:flujodatos} \nota{Material duplicat en "formatadors i
 560 reformatadors": declarar-ho, treure-ho? - feina Gema}
 561
 562 \section{Introduction}
 563
 564 The format of the data that circulate between the engine's modules has
 565 to be specified so that document processing is more effective and
 566 transparent. The proposed system design (see Section
 567 \ref{ss:descrarq}) imposes the need to use three different data stream
 568 types, as shown in Figure \ref{fig:fdatos}.
 569
 570 The stream format is text-based to facilitate, among other things, the
 571 diagnosis of possible system errors, since it is easy to manipulate
 572 the stream in order to reproduce the phenomena that are to be tested,
 573 and change it to see the result. Other benefits of using text streams
 574 are that it is possible to test independently the output of each
 575 module, and that it allows for fast building of prototypes to test the
 576 system's global performance, the validity of linguistic data, etc.
 577
 578
 579
 580 \begin{figure}[h]
 581 \begin{center}
 582 \includegraphics[width=14cm]{fdatos}
 583 \end{center}
 584 \caption{The different data stream types in the machine translation
 585 system. See the text for its description.}
 586 \label{fig:fdatos}
 587 \end{figure}
 588
 589 The data stream types are:
 590
 591 \begin{itemize}
 592 \item \textit{Data stream with format:} It is the text in its original
 593 format, with no further marks: XML, ANSI text, RTF, HTML, etc. Since
 594 it is the original format of the documents, nothing needs to be
 595 specified about it except the name of the format.
 596 \item \textit{Data stream without format:} It is the text with
 597 \textit{superblanks}, that is, with special characters that
 598 encapsulate the format (see Section \ref{ss:formato}); superblanks are
 599 treated by the linguistic modules as blanks between words (with some
 600 exceptions).  This is the format generated by the de-formatter and
 601 used by the re-formatter when generating the final translated
 602 document.
 603 \item \textit{Segmented data stream:} In this format, apart from
 604 superblanks, lexical units that are to be translated are delimited
 605 also with special characters. These characters are put by the
 606 morphological analyser and deleted by the generator, which delivers
 607 the final surface forms.
 608 \end{itemize}
 609
 610
 611 We describe next the characteristics of the data stream used between
 612 the modules of the translator, that is, the second and the third
 613 stream types. In general terms, it is a plain text format marked with
 614 characters that have a special meaning. This format is intended for
 615 the processing in servers that translate large volumes of text.
 616
 617 Some of the formats that the engine can process may contain extensive
 618 blocks of information in binary format ---RTF for instance, that may
 619 include bitmap images---.  To enable an efficient processing of this
 620 type of documents, we designed a way to extract this information and
 621 restore it after translation has been performed; see Section
 622 \ref{ss:formato} for a complete description.
 623
 624 \section{Data stream without format}
 625
 626 Data stream without format is output by the de-formatter and by the
 627 generator \nota{no del tot: postgenerador}, and is used as input by
 628 the morphological analyser, the post-generator and the re-formatter.
 629
 630 In the
 631 subsection of this section you can find a description of the method to
 632 delimit \textit{superblanks} and \textit{extensive superblanks}. As an example
 633 we will use the HTML document in
 634 Figure~\ref{fg:docorig}.
 635
 636 \begin{figure}[htbp]
 637 \begin{small}
 638 \begin{alltt}
 639 <\textbf{html}>
 640   <\textbf{head}>
 641     <\textbf{title}>Title</\textbf{title}>
 642   </\textbf{head}>
 643   <\textbf{body}>
 644     <\textbf{p}>Divided
 645        sentence</\textbf{p}>
 646   </\textbf{body}>
 647 </\textbf{html}>
 648 \end{alltt}
 649 \end{small}
 650 \caption{Example of HTML document}
 651 \label{fg:docorig}
 652 \end{figure}
 653
 654 The structural elements that must include this data stream type are
 655 the following:
 656
 657 \begin{itemize}
 658 \item \textit{Superblanks}.  Blocks that contain segments of format
 659 information included in the documents, when these are short.
 660 \item \textit{Extensive superblanks}.  Marks that are used to specify
 661 external documents that include segments of format information for the
 662 document being processed, when these segments are long.
 663 \item \textit{Text}. The document text that can be translated.
 664 \item \textit{Artificial sentence endings}. \label{finfrase} When the
 665 format in the document suggests a sentence separation that is not
 666 signalled by any punctuation mark (for instance, titles with no full
 667 stop at the end, or the content of cells in a table), the format
 668 processing must have a mechanism (invisible for the user) that enables
 669 the marking of these sentence endings.
 670 \item \textit{Special characters protection (for non-XML stream)}.
 671   Characters that must be protected to avoid conflict with the ones
 672   used in the data stream format.
 673 \end{itemize}
 674
 675 % \subsection{XML format}
 676
 677 % En este tipo de flujo se usa el elemento \texttt{<\textbf{b}>} para definir los
 678 % superblancos y los superblancos extensos.  Para el caso de los
 679 % \textbf{superblancos} la sintaxis es la siguiente:
 680
 681 % \begin{small}
 682 % \begin{alltt} % <\textbf{b}>\textit{contenido del bloque de formato}</\textbf{b}>
 683 % \end{alltt}
 684 % \end{small}
 685
 686 % Hay que resaltar que para los formatos basados en SGML, es necesario
 687 % incluir el formato en bloques \texttt{<![CDATA[\ldots]]>} dentro de
 688 % las marcas indicadas. \nota{millor dir com són: prendre text de EAMT
 689 % '05 - Gema} Por su parte, los \textit{superblancos extensos} se deben
 690 % expresar, a modo de atributos, de la siguiente manera:
 691
 692 % \begin{small}
 693 % \begin{alltt} % <\textbf{b} \textsl{filename}="\textit{nombre de fichero}"/>
 694 % \end{alltt}
 695 % \end{small}
 696
 697 % El \emph{texto} estará incluido entre los elementos \textbf{b} que se
 698 % acaban de explicar sin ninguna marca de estructura particular.
 699
 700 % Los \emph{finales de frase artificiales} se expresan mediante un punto y un
 701 % superblanco vacío inmediatamente a continuación.
 702
 703 % \begin{small}
 704 % \begin{alltt} % .<\textbf{b}/>
 705 % \end{alltt}
 706 % \end{small}
 707
 708 % Resumiendo, el flujo de datos de un documento en cualquier formato de los que
 709 % trata el traductor se reduce a otro documento XML que debe cumplir la
 710 % siguiente DTD:
 711
 712 % \begin{small}
 713 % \begin{alltt} % <!\textsl{ELEMENT} \textbf{document} (b|\textsl{#PCDATA})*>
 714 % <!\textsl{ELEMENT} \textbf{b} (\textsl{#PCDATA}?)>
 715 % <!\textsl{ATTLIST} b filename \textsl{CDATA} \textsl{#IMPLIED}>
 716 % \end{alltt}
 717 % \end{small}
 718
 719 % El resultado de encapsular el formato del fichero de la
 720 % figura~\ref{fg:docorig} en el flujo con formato XML se ve en la
 721 % figura~\ref{fg:docorigXML}.  Si hubiese algún superblanco que por su longitud
 722  % se convirtiese en un superblanco extenso, la forma de especificarlo sería como sigue:
 723 % \begin{small}
 724 % \begin{alltt} % <\textbf{b} \textsl{filename}="/tmp/ficherotemporal"/>
 725 % \end{alltt}
 726 % \end{small}donde \texttt{"/tmp/ficherotemporal"} es un fichero que
 727 % contiene el superblanco extenso para que pueda ser recuperado por el reformateador.
 728
 729 % \begin{figure}
 730 % \begin{small}
 731 % \begin{alltt} % <?\textbf{xml} \textsl{version}="1.0" \textsl{encoding}="iso-8859-15"?>
 732 % <\textbf{document}>
 733 % <\textbf{b}><![CDATA[<html> % <head>
 734  % <title>]]></\textbf{b}>Título.<\textbf{b}/><\textbf{b}><![CDATA[</title>
 735 % </head> % <body> % <p>]]></\textbf{b}>Frase<\textbf{b}><![CDATA[
 736 % ]]></\textbf{b}>dividida.<\textbf{b}/><\textbf{b}><![CDATA[ % </body>
 737 % </html>]]></\textbf{b}> % </\textbf{document}>
 738 % \end{alltt}
 739 % \end{small}
 740 % \caption{El documento de la figura \protect\ref{fg:docorig} con el
 741 % formato encapsulado usando marcas en XML y segmentos
 742 %\texttt{<![CDATA[\ldots]]>}}
 743 % \label{fg:docorigXML}
 744 % \end{figure}
 745
 746 %\subsection{Formato no XML}
 747 \subsection{Stream format}
 748 \label{se:noxml1} This format is based on the one used in the machine
 749 translation systems \textsf{interNOSTRUM}
 750 \cite{canals01b,garridoalenda01p,garrido99j} and \textsf{Traductor
 751 Universia} \cite{garrido03p, gilabert03j}.
 752
 753 In this stream type, the characters \texttt{[} and \texttt{]} are used
 754 to indicate \emph{superblanks}, as shown in the following example:
 755
 756 \begin{small}
 757 \begin{alltt}
 758 [\textit{superblank content}]
 759 \end{alltt}
 760 \end{small}
 761
 762 In the case of \emph{extensive superblanks}, the file name is
 763 specified using the at sign \texttt{@}:
 764
 765 \begin{small}
 766 \begin{alltt}
 767 [@\textit{file name}]
 768 \end{alltt}
 769 \end{small}
 770
 771 The \emph{text} is outside the superblank marks.
 772
 773 \emph{Artificial sentence endings}
 774 are expressed by a full stop and an empty superblank right after it.
 775
 776 \begin{small}
 777 \begin{alltt}
 778 .[]
 779 \end{alltt}
 780 \end{small}
 781
 782 The following table shows the \textbf{protected characters}:
 783
 784 \begin{center}
 785 \begin{tabular}{|l|c|c|l|} \hline Name & Character & Protected form&
 786 Meaning \\
 787 \hline
 788 At & \texttt{@} & \verb!\@! & External superblank\\
 789 Slash & \texttt{/} & \verb!\/! & Divider of meanings\\
 790 Backslash & \verb!\! & \verb!\\! & Protection character \\
 791 Caret & \verb!^! & \verb!\^! & Beginning of LF\\
 792 Opening square bracket & \texttt{[} & \verb!\[! & Beginning of blank\\
 793 Closing square bracket & \texttt{]} & \verb!\]! & End of blank \\
 794 Dollar & \verb!$! & \verb!\$! & End of LF\\
 795 Greater than & \texttt{>} & \verb!\>! & Begin. of morph. symbol\\
 796 Less than & \texttt{<} & \verb!\<! & End of morph. symbol \\
 797 \hline
 798 \end{tabular}
 799 \end{center}
 800
 801
 802 Figure ~\ref{fg:docorigtext} shows the document in Figure
 803 ~\ref{fg:docorig} after encapsulation.
 804
 805 \begin{figure}[here]
 806 \begin{small}
 807 \begin{alltt}
 808 [<html>
 809   <head>
 810     <title>]Title.[][</title>
 811   </head>
 812   <body>
 813     <p>]Divided[
 814        ]sentence.[][</p>
 815   </body>
 816 <html>]
 817 \end{alltt}
 818 \end{small}
 819 \caption{The document in Figure \protect\ref{fg:docorig} with format
 820 encapsulated using square brackets}
 821 \label{fg:docorigtext}
 822 \end{figure}
 823
 824
 825 \section{Segmented data stream}
 826
 827 Segmented data stream is the stream that circulates between the
 828 modules that handle linguistic information in the translation engine.
 829 In this stream, words are delimited and labelled.  There are two types
 830 of segmented stream:
 831
 832 \begin{itemize}
 833 \item \textit{Ambiguous segmented stream}. Its main characteristic is
 834 that words have a surface form and potentially more than one lexical
 835 form (lexical multiform).  This stream type is the format in which
 836 the morphological analyser provides the input data for the
 837 part-of-speech tagger (see diagram \ref{eq:formaanalizada} in page
 838 ~\pageref{formaanalizada} for a detailed description of ambiguous
 839 segmented stream).
 840
 841 \item \textit{Unambiguous segmented stream}.  It has only one lexical
 842 form for each word and it does not include the surface form.  This is
 843 the format in which data circulate from the part-of-speech tagger to
 844 the transfer module, and from this module to the generator (see
 845 diagram \ref{eq:formaanalizada2} in page~\pageref{formaanalizada2} for
 846 a detailed description of the format of unambiguous segmented stream).
 847 \end{itemize}
 848
 849 Furthermore, besides the information already marked in the data stream
 850 without format, the new stream has to enable marking of the following
 851 information:
 852
 853 \begin{itemize}
 854 \item \textit{Lexical units}.  A lexical unit is made of a surface
 855 form (in the case of ambiguous segmented stream) plus one or more
 856 lexical forms (the different possible analyses of the SF) with their
 857 grammatical symbols.
 858 \item \textit{Surface forms (ambiguous segmented stream)}.  The word
 859 as it appears in the original text.
 860 \item \textit{Lexical forms}.  The lemma of the word and its
 861 grammatical symbols.
 862 \item \textit{Grammatical symbols}.  They describe the morphological
 863 and grammatical attributes of a surface form.
 864 \end{itemize}
 865
 866 % \subsection{XML format}
 867
 868 % Las \textit{palabras} se etiquetan de la forma que se muestra a
 869 % continuación:
 870
 871 % \begin{small}
 872 % \begin{alltt} % <\textbf{w}>\textit{información de la palabra}</\textbf{w}>
 873 % \end{alltt}
 874 % \end{small}
 875
 876 % Para el caso del \textit{flujo de datos segmentado ambiguo}, la
 877 % \textit{forma superficial} se indica en el interior de un elemento
 878 % \texttt{<\textbf{w}>} mediante el contenido de un único elemento
 879 %\texttt{<\textbf{sf}>}.  A continuación, se sitúan la forma o
 880 %\textit{formas léxicas} que sean necesarias:
 881
 882 % \begin{small}
 883 % \begin{alltt} % <\textbf{w}> % <\textbf{sf}>\textit{forma superficial}</\textbf{sf}>
 884 % <\textbf{lf}>\textit{forma léxica 1}</\textbf{lf}>
 885 % <\textbf{lf}>\textit{forma léxica 2 (opcional)}</\textbf{lf}>
 886 % ...  % </\textbf{w}>
 887 % \end{alltt}
 888 % \end{small}
 889
 890 % Para el caso del flujo no ambiguo, sólo se especifica una única forma léxica.
 891
 892
 893 % \begin{small}
 894 % \begin{alltt} % <\textbf{w}> % <\textbf{lf}>\textit{forma léxica}</\textbf{lf}> % </\textbf{w}>
 895 % \end{alltt}
 896 % \end{small}
 897
 898 % %% \pagebreak
 899
 900 % La DTD de este flujo de datos para textos \textit{sin desambiguar} es la % que se muestra en la figura~\ref{fg:ambdtd} a continuación.
 901
 902 % \begin{figure}[here]
 903 % \begin{small}
 904 % \begin{alltt}
 905 %   <!\textsl{ELEMENT} \textbf{document} (b|w|\textsl{#PCDATA})*>
 906 %   <!-- atención, el #PCDATA anterior sigue siendo necesario para los
 907 %        carácteres no etiquetados y que no forman parte del formato -->
 908 %   <!\textsl{ELEMENT} \textbf{b} (\textsl{#PCDATA}?)>
 909 %   <!\textsl{ATTLIST} b filename \textsl{CDATA} \textsl{#IMPLIED}>
 910 %   <!\textsl{ELEMENT} \textbf{w} (sf,lf+)>
 911 %   <!\textsl{ELEMENT} \textbf{sf} (\textsl{#PCDATA})>
 912 %   <!\textsl{ELEMENT} \textbf{lf} (\textsl{#PCDATA}|s)+>
 913 %   <!\textsl{ELEMENT} \textbf{s} \textsl{EMPTY}>
 914 %   <!\textsl{ATTLIST} s n \textsl{IDREF #REQUIRED}>
 915 % \end{alltt}
 916 % \end{small}
 917 % \caption{DTD para textos no desambiguados con formato XML}
 918 % \label{fg:ambdtd}
 919 % \end{figure}
 920
 921
 922
 923
 924 % Para los ya \textit{ desambiguados}, los textos deben cumplir la DTD de la figura~\ref{fg:desambdtd}.
 925
 926 % \begin{alltt}
 927 %   <!\textsl{ELEMENT} \textbf{document} (b|w|\textsl{#PCDATA})*>
 928 %   <!-- atención, el #PCDATA anterior sigue siendo necesario para los
 929 %        carácteres no etiquetados y que no forman parte del formato -->
 930 %   <!\textsl{ELEMENT} \textbf{b} (\textsl{#PCDATA}?)>
 931 %   <!\textsl{ATTLIST} b filename \textsl{CDATA} \textsl{#IMPLIED}>
 932 %   <!\textsl{ELEMENT} \textbf{w} (lf)>
 933 %   <!\textsl{ELEMENT} \textbf{lf} (\textsl{#PCDATA}|s)+>
 934 %   <!\textsl{ELEMENT} \textbf{s} \textsl{EMPTY}>
 935 %   <!\textsl{ATTLIST} s n \textsl{IDREF #REQUIRED}>
 936 % \end{alltt}
 937 % \end{small}
 938 % \caption{DTD para textos desambiguados con formato XML}
 939 % \label{fg:desambdtd}
 940 % \end{figure}
 941
 942 % La figura~\ref{fg:docorigXML2} muestra un ejemplo de segmentación del flujo
 943 % que incluye la forma de encapsular el formato y la información léxica.  Este
 944 % ejemplo es para el caso de flujo segmentado ambiguo y corresponde al texto
 945 % HTML original de la figura~\ref{fg:docorig}.
 946
 947 % \begin{figure}[htbp]
 948 % \begin{small}
 949 % \begin{alltt}
 950 % <?\textbf{xml} \textsl{version}="1.0" \textsl{encoding}="iso-8859-15"?>
 951 % <document>
 952 % <\textbf{b}><![CDATA[<html>
 953 %   <head>
 954 %     <title>]]></\textbf{b}>
 955 % <\textbf{w}>
 956 %   <\textbf{sf}>Título<\textbf{sf}>
 957 %   <\textbf{lf}>Título<\textbf{s} \textsl{n}="n"/><\textbf{s} \textsl{n}="m"/><\textbf{s} \textsl{n}="sg"/></\textbf{lf}>
 958 % </\textbf{w}>
 959 % <\textbf{w}>
 960 %   <\textbf{sf}>.</\textbf{sf}>
 961 %   <\textbf{lf}>.<s n="sent"/></\textbf{lf}>
 962 % </\textbf{w}><\textbf{b}/>
 963 % <\textbf{b}><![CDATA[</title>
 964 %   </head>
 965 %   <body>
 966 %     <p>]]></\textbf{b}>
 967 % <\textbf{w}>
 968 %   <\textbf{sf}>Frase</\textbf{sf}>
 969 %   <\textbf{lf}>Frase<s n="n"/><s n="f"/><s n="sg"/></\textbf{lf}>
 970 % </\textbf{w}>
 971 % <\textbf{b}><![CDATA[
 972 %        ]]></\textbf{b}>
 973 % <\textbf{w}>
 974 %   <\textbf{sf}>dividida</\textbf{sf}>
 975 %   <\textbf{lf}>dividir<s n="vblex"/><s n="pp"/><s n="f"/><s n="sg"/></\textbf{lf}>
 976 % </\textbf{w}>
 977 % <\textbf{w}>
 978 %   <\textbf{sf}>.</\textbf{sf}>
 979 %   <\textbf{lf}>.<s n="sent"/></\textbf{lf}>
 980 % </\textbf{w}><\textbf{b}/>
 981 % <\textbf{b}><![CDATA[
 982 %   </body>
 983 % <html>]]></\textbf{b}>
 984 % </document>
 985 % \end{alltt}
 986 % \end{small}
 987 % \caption{Ejemplo de flujo segmentado con el formato encapsulado en XML,
 988 %   correspondiente al documento HTML de la figura~\ref{fg:docorig}.}
 989 % \label{fg:docorigXML2}
 990 % \end{figure}
 991 %\subsection{Formato no XML}
 992 %\subsubsection{Formato de flujo}
 993 \label{se:noxml2} The symbols '\verb!^!' for word beginning and
 994 '\verb!$!' for word end are used to delimit \textit{words}, as shown
 995 in this example:
 996
 997 \begin{small}
 998 \begin{alltt}
 999   \verb!^!\textit{word}\verb!$!
1000 \end{alltt}
1001 \end{small}
1002
1003 To separate the \textit{surface form} and the following
1004 \textit{lexical forms}, the symbol \texttt{/} is used.  This separator
1005 only has sense in the ambiguous segmented stream, since in the
1006 unambiguous stream there is only the lexical form.  It is used as
1007 follows:
1008
1009 \begin{small}
1010 \begin{alltt}
1011   \verb!^!\textit{surface form}/\textit{lexical form 1}/...\verb!$!
1012 \end{alltt}
1013 \end{small}
1014
1015 Lexical forms can include symbols (generally located at the end), as
1016 shown in the example of Figure \ref{fg:docorigtext2}.
1017
1018
1019 \begin{figure}
1020 \begin{small}
1021 \begin{alltt}
1022 [<html>
1023    <head>
1024      <title>]^Title/Title<n><m><sg>\$^./.<sent>\$[][</title>
1025    </head>
1026    <body>
1027      <p>]^Divided/Divide<vblex><pp>/Divided<vblex><past>\$[
1028         ]^sentence/sentence<n><sg>/sentence<vblex><inf>\$^./.\\<sent>\$[][</p>
1029   </body>
1030 <html>]
1031 \end{alltt}
1032 \end{small}
1033 \caption{Example of segmented stream with format encapsulated in
1034 non-XML format, corresponding to the HTML document in Figure
1035 ~\ref{fg:docorig}.}
1036 \label{fg:docorigtext2}
1037 \end{figure}
1038
1039
1040
1041
1042
1043 \chapter{Modules specification}
1044 \label{se:especificmodulos}
1045
1046
1047
1048 \section{Lexical processing modules}
1049 \label{ss:modproclex}
1050
1051 \subsection{Module description }
1052 \label{ss:funcproclex}
1053
1054 One of the most efficient approaches to lexical processing is based on
1055 the use of finite-state transducers (FST)
1056 \cite{mohri97a,roche97b}. FST are a type of finite-state automata,
1057 which may be used as one-pass morphological analysers and generators
1058 and may be very efficiently implemented. In this project, we have used
1059 a class of FST called letter-transducers
1060 \cite{roche97b,garrido02a,garrido99j}; in fact, any finite-state
1061 transducer may always be turned into a letter-transducer. Garrido and
1062 collaborators \cite{garrido99j,garrido02a} give a formal definition of
1063 the letter transducers used in this project; describing them
1064 informally, a letter-transducer is an idealised machine consisting of:
1065 \begin{enumerate}
1066
1067 \item A (finite) set of states, that is, of situations in which the
1068 transducer can be while it is reading, from left to right, the input
1069 letters or symbols. Among the states of the set, we can distinguish:
1070
1071 \begin{enumerate}
1072 \item A single initial state: this is the state in which the
1073 transducer is before processing the first letter or the first symbol
1074 of the input.
1075 \item One or more acceptance states, which are only reached after
1076 having completely read a valid entry and, therefore, are used to
1077 detect valid words.
1078 \end{enumerate}
1079 \item A set (also finite) of state transitions consisting of:
1080 \begin{enumerate}
1081 \item the origin state
1082 \item the destination state
1083 \item the input letter or symbol
1084 \item the output letter or symbol
1085 \end{enumerate} To make possible that input and output have different
1086 lengths at any time, it is allowed that there is no input symbol, that
1087 there is no output symbol or that there is neither input nor output
1088 symbol. This case is generally represented using a special symbol (the
1089 empty symbol).
1090 \end{enumerate}
1091
1092 Every time the transducer reads an entry symbol, it creates a list of
1093 \emph{live} or \emph{active} states, each one of which has an
1094 associated output (a sequence of symbols). The way the letter
1095 transducer works is different for each type of lexical processing
1096 operation. For example, in the morphological analysis, the transducer
1097 tries to read the longest entry recognised by the dictionary
1098 (``left-to-right, longest-match'' mode).
1099 \begin{enumerate}
1100 \item Beginning: the set of live states is given a single live state:
1101 the initial state, with the empty word ("") as output associated to
1102 the state.
1103 \item When from one of the states in the current set of live states it
1104 is possible to reach other states through transitions that do not have
1105 input symbol, these states are added to the set of live states, and
1106 are associated to the output obtained when extending the associated
1107 outputs with the output symbol found in the corresponding
1108 transitions. This expansion operation of the set of live states
1109 continues until it is not possible to add more states.
1110 \item A symbol from the input word is read.
1111 \item A new set of live states is created, made with the states
1112 reached through transitions that have that symbol as input, and this
1113 states are associated to the outputs extended by adding the
1114 corresponding output symbols found in the transitions.
1115 \item If the current set has any live state, the process continues on
1116 step 2.
1117 \item The sets of live states are read backwards until a set is found
1118 which contains acceptance states. The morphological analyses will be
1119 the outputs associated to these states, and the reading position is
1120 set to the position immediately after this set (so that it can be
1121 processed again by the transducer in the next pass).
1122 \end{enumerate} Not all acceptance states have the same
1123 characteristics, and this fact adds more conditions to the acceptance
1124 process, in order to be able to deal with unknown words or with words
1125 that are joined to other words, as will be explained later.
1126
1127 The transducer reads the input word only once on average, from right
1128 to left and symbol by symbol, and keeps a tentative list of possible
1129 partial outputs that is updated and pruned as the input is being
1130 read. When letter transducers are used as morphological analysers or
1131 as lemmatizers, they read a surface form and write the resulting
1132 lexical form(s). In this case, input symbols are the letters of the
1133 surface form, and output symbols are the letters needed to write the
1134 lemmas, as well as the letters and special symbols needed to represent
1135 the morphological analysis, such as in \texttt{<n>}, \texttt{<f>},
1136 \texttt{<2p>}, etc.
1137
1138 The transducers work in a similar way for other lexical processing
1139 tasks.
1140
1141 \nota{La noció de LRLM (left-to-right, longest-match) (o ODSCML,
1142 "izquierda a derecha, recortando el segmento concordante más largo")
1143 ha de quedar clara en el funcionamient del morfològic i del trànsfer
1144 estructural. Afegir coses de l'article de EAMT 2005.}
1145
1146 \subsubsection{Letter case handling in dictionaries}
1147 \label{mayusc}
1148
1149 The same input word in a lexical processing module can be written
1150 differently regarding letter case.  The most frequent cases are:
1151
1152 \begin{itemize}
1153 \item The whole word is in lower case.
1154 \item The whole word is in upper case.
1155 \item The first letter is capitalised and the rest is in lower case
1156 (typical case for proper nouns).
1157 \end{itemize}
1158
1159 The transductions in the dictionary can also be found in these three
1160 states.  The way in which one word is written in the dictionary is
1161 used to discard possible analysis of the word, according to the
1162 following rules:
1163
1164 \begin{itemize}
1165 \item If the input letter is upper case and in the current analysis
1166 state there are concordant transitions in lower case, these
1167 transductions are made.
1168 \item If the input letter is lower case and in the current state there
1169 are not concordant transitions in lower case, the transductions are
1170 not made.
1171 \end{itemize}
1172
1173 Thanks to this policy, a surface form that is not capitalised can not
1174 be analysed as a proper noun.
1175
1176 The case of an input word will be maintained in the output of the
1177 translator unless it is decided not to do so. The case can be changed
1178 in the structural transfer module; this option is useful, for example,
1179 when there is a reordering of words or when a word is added before a
1180 capitalised word at the beginning of a sentence, such as in the
1181 translation of the Catalan phrase \emph{Vindran} into
1182 English: \emph{They will come}.
1183
1184
1185 \subsection{Data format: the dictionaries}
1186 \label{ss:diccionarios}
1187 \subsubsection{General criteria for dictionary design}
1188
1189 The experience of the Transducens group at the Universitat d'Alacant
1190 in the creation of machine translation systems between Romance
1191 languages (\texttt{es}, \texttt{ca} and \texttt{pt}) already operative
1192 and publicly accessible has inspired the main characteristics of the
1193 whole shallow-transfer machine translation system described in this
1194 document, as well as its application to the Romance languages of Spain
1195 (\texttt{es}, \texttt{ca} and \texttt{gl}). In some sense, it could be
1196 stated that in the present project the only work was to adapt (rewrite
1197 in a standardised and interoperable format) the specifications and
1198 programs used in already operative projects.
1199
1200 In particular, the design of the dictionaries has been based in an
1201 architecture that pretends to separate, as far as possible, the source
1202 language from the target language, even knowing that these
1203 dictionaries are translation-oriented and, therefore, that it is not
1204 advisable to elaborate them completely separately.  The chosen format
1205 is used for the specification of both morphological dictionaries
1206 (monolingual) and bilingual dictionaries.
1207
1208 The format for dictionaries, as well as for the rest of linguistic
1209 data (definition file for part-of-speech tagger and structural
1210 transfer rules) is XML\footnote{\url{http://www.w3.org/XML/}}, an
1211 international standard used in numerous natural language processing
1212 projects which, thanks to the availability of many utilities and
1213 libraries, it is becoming a very powerful tool for linguistic data
1214 representation and exchange (see article \cite{ide00}).
1215
1216
1217
1218 Dictionaries are designed so that they can be compiled into
1219 \textit{letter transducers }, for efficiency reasons. For more
1220 information on letter transducers as a particular case of finite-state
1221 transducers, see Section \ref{ss:funcproclex} or the article
1222 \cite{garrido02a}.
1223
1224 The letter transducers that are generated from the system dictionaries
1225 (morphological, bilingual and post-generation dictionaries) process
1226 input character strings to produce output strings. According to this,
1227 dictionaries are made of entries consisting of string pairs that
1228 correspond to the inputs and outputs of the transducer.
1229
1230
1231 The most powerful tool in these dictionaries is the definition and use
1232 of \emph{paradigms}. Since in Romance languages a lot of lemmas share
1233 the same inflection pattern (there are regularities in their
1234 inflection), it is useful and straightforward to group these
1235 regularities in inflection paradigms to avoid having to write all the
1236 forms of every word. Paradigms allow the representation of dictionary
1237 entries compactly and help optimise the speed for building a
1238 dictionary. Once the most frequent paradigms in a dictionary are
1239 defined, the linguist does not need to bother, in most of the cases,
1240 with the whole inflection of a new term, since entering an inflective
1241 word is generally limited to writing the lemma and choosing one
1242 inflection pattern among the previously defined paradigms.
1243 Furthermore, the use of paradigms reduces the memory requisites,
1244 facilitates the construction of efficient letter transducers and
1245 speeds up the compilation process \cite{ortiz05j}. We did not use
1246 paradigms in bilingual dictionaries (although it is possible to)
1247 because most of the inflection information is processed implicitly in
1248 these dictionaries, as explained in page~\pageref{ss:bil}.
1249
1250
1251
1252 \subsubsection{Dictionary types}
1253
1254 In our system there are three types of dictionaries: morphological
1255 (monolingual) dictionaries for each of the languages involved
1256 (Spanish, Catalan and Galician); bilingual dictionaries for the
1257 different translation pairs (Spanish--Catalan and Spanish--Galician),
1258 and post-generation dictionaries for each of the languages (a
1259 post-generation dictionary is not a typical dictionary, with lemmas
1260 and morphological information, but is like a little dictionary of the
1261 orthographic transformations that may undergo words when they come
1262 together).  The structure of the three dictionary types is specified
1263 by the same DTD (\emph{Document Type Definition}), which can be found
1264 in Appendix \ref{ss:dtd_dics}.
1265
1266
1267 \textbf{Morphological dictionaries} are used both for building
1268 morphological analysers ---the translation system module used to
1269 obtain all the possible lexical forms for a certain surface form in
1270 the source language --- and morphological generators
1271 ---the module that generates the surface form in the target language
1272 from the lexical form of each word---.  These two modules are obtained
1273 from a single morphological dictionary, depending on the direction in
1274 which it is read by the system: read from left to right, we obtain the
1275 analyser, and read from right to left, the generator.
1276
1277 The block structure typical for these dictionaries is the following:
1278
1279 \begin{itemize}
1280 \item \textit{An alphabet definition}.  This definition is used
1281 exclusively for building the morphological analyser; specifically, it
1282 enables the morphological analyser to appropriately tokenize unknown
1283 words and the ones in the conditional sections (see the description of
1284 the element \texttt{<section>} in page \pageref{ss:section}); the
1285 morphological generator does not need this definition.
1286
1287 \item \textit{A definition of symbols}.  It consists of a declaration
1288 of the grammatical symbols that will be used in dictionary entries
1289 (you can find in Appendix \ref{se:simbolosmorf} a list with the
1290 grammatical symbols used in this project).
1291 \item \textit{A definition of paradigms}.  Paradigms need to be
1292 defined here in order to be used in the dictionary sections or in other
1293 paradigms.
1294 \item \textit{One or more dictionary sections with conditional
1295   tokenization}, type \texttt{standard}. To include most of the words
1296   of the dictionary.
1297 \item \textit{One or more dictionary sections with unconditional
1298   tokenization}.  To include certain words that follow a regular
1299   pattern or that are tokenized regardless the text directly after
1300   them (see description of the element \texttt{<section>} in page
1301   \pageref{ss:section}). In the Catalan morphological dictionaries,
1302   words requiring an unconditional tokenization are distributed in two
1303   sections: one for the forms that require the introduction of a blank
1304   immediately after (due to processing requirements of the lexical
1305   forms), like the apostrophized forms \emph{l'} or \emph{d'}, and
1306   another one for punctuation marks, numbers and other signs.
1307
1308 \end{itemize}
1309
1310 \textbf{Bilingual dictionaries} represent in the system the lexical
1311 transfer process, that is, the assignment of the TL lexical form that
1312 corresponds to each SL lexical form. Two \emph{products} are obtained
1313 from each bilingual dictionary, depending on the direction in which it
1314 is read by the system: when the dictionary is read from left to right,
1315 we obtain the lexical transfer module in one translation direction,
1316 and when it is read from right to left, in the other direction. For
1317 the bilingual dictionaries of our project, it has been established
1318 that Spanish will be put always on the left side of the entries, and
1319 the rest of the languages (Catalan and Galician), on the right
1320 side. Thus, for example, the bilingual Spanish--Galician dictionary
1321 will be read from left to right for the translation
1322 \texttt{es}--\texttt{gl} and from right to left for the translation
1323 \texttt{gl}--\texttt{es}.  In applications like the ones in this
1324 project, these dictionaries do not have paradigms: they are build with
1325 generic entries which almost always have no more information than
1326 lemma and part of speech, and there is no inflection information.
1327
1328 The block structure used in the bilingual dictionaries of this project
1329 is the following:
1330
1331 \begin{itemize}
1332 \item \textit{A definition of symbols}.  It consists of a declaration
1333 of the grammatical symbols that will be used in dictionary entries.
1334 \item \textit{A single dictionary section}.  Where bilingual
1335 correspondences are specified.
1336 \end{itemize}
1337
1338 Since 2007, bilingual dictionaries allow the specification of more
1339 than one TL translation, so that a lexical selection module (see
1340 Section \ref{se:seleccio_lex}) can choose the most suitable equivalent
1341 according to the context. To that end, an attribute has been added to
1342 bilingual dictionaries. You can find its description in section
1343 \ref{dic_lextor}.
1344
1345
1346 \textbf{Post-generation dictionaries} are used to perform some
1347 transformations (orthographic changes, contractions, apostrophation,
1348 etc.) required after surface forms in the target language have been
1349 generated and come into contact with each other.  Since this kind of
1350 operations can be expressed as a translation of character strings, it
1351 has been decided to use the same type of dictionaries. It is
1352 implicitly assumed that the parts of the text whose processing has not
1353 been specified are copied just as they arrive. In these dictionaries,
1354 the definition of paradigms is useful to express systematic changes in
1355 the word contact phenomena. Unlike the other dictionary types, these
1356 do not include grammatical symbols, since they process surface forms.
1357
1358 The block structure of post-generation dictionaries is the following:
1359 \begin{itemize}
1360
1361 \item \textit{A definition of paradigms}. To use in entries.
1362 \item \textit{A dictionary section}.  Where the patterns for
1363 post-generation operations are specified.
1364 \end{itemize}
1365
1366
1367 The following table contains an overview of the possible reading
1368 directions of dictionaries and their application to the Romance
1369 languages in this project:
1370
1371 \begin{center}
1372  \begin{tabular}{|l|l|l|}
1373 \hline
1374 Dictionary & Reading direction & Function \\
1375 \hline
1376 Morphological & left--right & analysis for \texttt{es}, \texttt{ca} and \texttt{gl}\\
1377               & right--left & generation for \texttt{es}, \texttt{ca} and \texttt{gl}\\\hline
1378 Bilingual     & left--right & translation for \texttt{es-ca} and \texttt{es-gl}\\
1379               & right--left & translation for \texttt{ca-es} and \texttt{gl-es}\\\hline
1380 Post-generation & left--right & post-generation for \texttt{ca}, \texttt{es} and \texttt{gl}\\\hline
1381
1382 \end{tabular}
1383 \end{center}
1384
1385
1386
1387 \subsubsection{Description of the dictionary format}
1388 \label{formatodics} This section presents the main elements of the
1389 format in which dictionaries are build. The formal definition (a DTD)
1390 can be found in Appendix ~\ref{ss:dtd_dics}.  Section \ref{dic_lextor}
1391 describes the characteristics of a bilingual dictionary that works in
1392 an Apertium system with lexical selection module.  Finally, from pages
1393 \pageref{ss:morfgen} to %\pageref{ss:bil} y
1394 \pageref{ss:postgen} there
1395 is a description of the different particularities of entries for the
1396 three dictionary types (morphological, bilingual and post-generation).
1397
1398
1399
1400 \paragraph{Element for dictionary \texttt{<dictionary>}}
1401
1402 This is the root element and includes the whole dictionary.  It
1403 contains an alphabetic character definition, a definition of symbols
1404 (which are the morphological tags for the words), a definition of
1405 inflection paradigms and one or more dictionary sections, which
1406 contain the entries for the lexical forms (consisting of pairs made of
1407 surface form--lexical form). Figure \ref{fig:dictionary} shows the
1408 basic block structure of a generic dictionary.
1409
1410 \begin{figure}
1411 \begin{small}
1412 \begin{alltt}
1413 <?\textbf{xml} \textsl{version}="1.0" \textsl{encoding}="iso-8859-15"?>
1414 <\textbf{dictionary}>
1415   <\textbf{alphabet}>abcdefghijk ... ABCDEFGH ... çñáéíóú</\textbf{alphabet}>
1416   <\textbf{sdefs}>
1417     <!-- ... -->
1418   </\textbf{sdefs}>
1419   <\textbf{pardefs}>
1420     <!-- ... -->
1421   </\textbf{pardefs}>
1422   <\textbf{section} ...>
1423     <!-- ... -->
1424   </\textbf{section}>
1425   <!-- ... -->
1426 </\textbf{dictionary}>
1427 \end{alltt}
1428 \end{small}
1429 \caption{Use of the elements \texttt{<\textbf{dictionary}>} and
1430 \texttt{<\textbf{alphabet}>}}
1431 \label{fig:dictionary}
1432 \end{figure}
1433
1434
1435 \paragraph{Element for alphabet \texttt{<alphabet>}}
1436
1437 It is used to specify a definition of alphabetic characters.  The
1438 purpose of this specification is enabling the modules that process the
1439 input by means of letter transducers to tokenize it in individual
1440 words.\nota{Parlar dels mots desconeguts. Cita \ref{ss:section} -
1441 Mikel?}
1442
1443 In the present design, the definition of an alphabet only has sense in
1444 morphological dictionaries, since it is needed for the
1445 analysis. Figure \ref{fig:dictionary} shows a use example for this
1446 element.
1447
1448
1449 \paragraph{Element for symbol definition section \texttt{<sdefs>}} It
1450 groups all the symbol definitions in a dictionary
1451 (\texttt{<\textbf{sdef}>}).  There is an example of its use in Figure
1452 \ref{fig:sdefs}.
1453
1454 \paragraph{Element for symbol definition \texttt{<sdef>}}
1455
1456 It is an empty element (it does not delimit any content): it is used
1457 to specify, through the values of the attribute \texttt{\textsl{n}},
1458 the names of the grammatical symbols that are used in the dictionary
1459 to morphologically label lexical forms. In Figure \ref{fig:sdefs} you
1460 can find a use example for this element. Refer to Appendix
1461 \ref{se:simbolosmorf} if you need a list with all the grammatical
1462 symbols used in the dictionaries of this project.
1463
1464 \begin{figure}
1465 \begin{small}
1466 \begin{alltt}
1467 <\textbf{sdefs}>
1468   <\textbf{sdef} \textsl{n}="n"/>
1469   <\textbf{sdef} \textsl{n}="det"/>
1470   <\textbf{sdef} \textsl{n}="sg"/>
1471   <\textbf{sdef} \textsl{n}="pl"/>
1472   <!-- ... -->
1473 </\textbf{sdefs}>
1474 \end{alltt}
1475 \end{small}
1476 \caption{Use of the element \texttt{<\textbf{sdefs}>}}
1477 \label{fig:sdefs}
1478 \end{figure}
1479
1480 \paragraph{Element for dictionary section \texttt{<section>}}
1481 \label{ss:section}
1482
1483 It contains the words that will be recognised by the dictionary.  The
1484 reason to divide a dictionary in sections is that some forms ---for
1485 example, the ones coming from the identification of certain regular
1486 patterns, or some forms that pertain to a specific dialect--- may need
1487 a different processing.
1488
1489 One of the problems that the definition of sections in a dictionary
1490 helps to solve is the tokenization procedure during morphological
1491 analysis.  Most of the forms are tokenized following a conditional
1492 criterion: identifying if the character being processed is followed by
1493 a non-alphabetic character ---that is, not defined in
1494 \texttt{<\textbf{alphabet}>}---. However, there are other forms, like
1495 the Catalan apostrophized words \emph{l'} or \emph{d'}, that need an
1496 unconditional tokenization model: there is no need to analyse what
1497 comes after them, since, if it is an alphabetic character, it will
1498 belong to the \textit{next} word. The forms that require unconditional
1499 tokenization are included in a specific section of the
1500 dictionary. Other kinds of processing can also be solved through these
1501 divisions.
1502
1503
1504
1505 The value of the attribute \texttt{\textsl{type}} is used to express
1506 the kind of string tokenization applied in each dictionary section:
1507 the possible values of this attribute are: \texttt{standard}, for
1508 almost all the forms of the dictionary (conditional mode),
1509 \texttt{postblank}, for the forms that require an unconditional
1510 tokenization and the placing of a blank, and \texttt{inconditional}
1511 for the rest of forms that require unconditional tokenization.
1512
1513 The attribute \texttt{\textsl{id}} is used to assign an identifier (a
1514 name) to the dictionary sections.
1515
1516 \begin{figure}
1517 \begin{small}
1518 \begin{alltt}
1519 <\textbf{section} \textsl{id}="principal" \textsl{type}="standard">
1520 <!-- ... -->
1521 </\textbf{section}>
1522 <\textbf{section} \textsl{id}="patterns" \textsl{type}="inconditional">
1523 <!-- ... -->
1524 </\textbf{section}>
1525 \end{alltt}
1526 \end{small}
1527 \caption{Use of the element \texttt{<\textbf{section}>}}
1528 \label{fig:section}
1529 \end{figure}
1530
1531 \paragraph{Element for entries \texttt{<e>}}
1532
1533 An entry is the basic unit of a dictionary or of a paradigm
1534 definition.  Entries consist of a concatenation in any order of string
1535 pairs \texttt{<\textbf{p}>}, identity transductions
1536 \texttt{<\textbf{i}>}, references to paradigm \texttt{<\textbf{par}>}
1537 or regular expressions \texttt{<\textbf{re}>}.  The structure and
1538 meaning of these elements is explained later in this section (in pages
1539 ~\pageref{ss:p}, \pageref{ss:i}, \pageref{ss:par} and \pageref{ss:re}
1540 respectively).
1541
1542 \label{restric}Two optional attributes are used with this entry.  The
1543 first one is \texttt{\textsl{r}} (for \textit{restriction}), which
1544 specifies if the entry has to be considered only when reading the
1545 dictionary from left to right (\texttt{LR}) or when reading it from
1546 right to left (\texttt{RL}).  If nothing is specified, it is assumed
1547 that the entry must be considered in both directions.
1548
1549 In morphological dictionaries, the restriction \texttt{LR} causes that
1550 a LF is analysed but not generated (for example, when the LF belongs
1551 to a dialectal variant that we wish to recognise but not to generate)
1552 and the restriction \texttt{RL} causes that a word is generated but not
1553 analysed (needed, for example, for forms with post-generator
1554 activation mark, see page \pageref{ss:a} for more details).
1555
1556 In bilingual dictionaries, the restrictions \texttt{LR} and
1557 \texttt{RL} cause that the translation is done only in one direction:
1558 for example, in a bilingual \texttt{es}--\texttt{ca} dictionary,
1559 \texttt{LR} indicates that the LF is only translated from Spanish to
1560 Catalan, and \texttt{RL} only from Catalan to Spanish. Let's
1561 illustrate it with an example: the Spanish adverbs \emph{aún} and
1562 \emph{todavía} ("still") are translated into Catalan as the same word,
1563 \emph{encara}. We can only translate the Catalan adverb \emph{encara}
1564 as one of both words into Spanish (there is no difference in meaning);
1565 we decide to translate it as \emph{todavía}. In this case, we have to
1566 write two entries in the bilingual dictionary: the entry that matches
1567 \emph{aún} with \emph{encara} needs to have the restriction
1568 \texttt{LR} (translation only from \texttt{es} to \texttt{ca}) and the
1569 one that matches \emph{todavía} with \emph{encara} does not need to
1570 have any restriction (translation in both directions).
1571
1572 Direction restrictions are also necessary in bilingual dictionaries
1573 when we have words with gender to be determined ("GD") or number to be
1574 determined ("ND") (consult page ~\pageref{ss:bil} for more
1575 information).
1576
1577 The other optional attribute in entries is the lemma name
1578 \texttt{\textsl{lm}}. Due to the employment of paradigms to represent
1579 the inflection regularities of lexical units, an entry in
1580 morphological dictionaries contains the part of the lemma that is
1581 common to all the inflected forms, that is, it contains the lemma cut
1582 at the point in which the paradigm regularity begins (for example, the
1583 Spanish adjectives \emph{distinto}, \emph{absoluto} and \emph{marino}
1584 appear in entries as \emph{distint}, \emph{absolut} and \emph{marin},
1585 since the rest of the inflected forms is common to all of them and
1586 specified in a paradigm).  This fact can make the dictionary difficult
1587 to understand.  Therefore entries have this attribute, which contains
1588 the whole lemma of the lexical form, so that the dictionary becomes
1589 more understandable and linguists can solve problems quickly.  In
1590 bilingual dictionaries, which normally do not have references to
1591 paradigms,\footnote{They could have references to paradigms, but we
1592 did not judge it necessary for the languages involved \nota{atenció:
1593 ex--, vice--?}.} this attribute is not used.
1594
1595
1596 \paragraph{Element for string pair \texttt{<p>}}
1597 \label{ss:p}
1598
1599 This basic element of dictionaries is used in any kind of entry to
1600 indicate the correspondence between two strings; this
1601 correspondence specifies a lexical transformation that will be carried
1602 out by a state path in the resulting finite-state transducer
1603 \cite{garrido99j}.
1604
1605 It is defined by a pair of internal elements: The left element
1606 (\texttt{<\textbf{l}>}) and the right element (\texttt{<\textbf{r}>}).
1607 Its structure is shown in Figure \ref{fig:p}.
1608
1609 \begin{figure}
1610 \begin{small}
1611 \begin{alltt}
1612 <\textbf{p}>
1613   <\textbf{l}><!-- ... --></\textbf{l}>
1614   <\textbf{r}><!-- ... --></\textbf{r}>
1615 </\textbf{p}>
1616 \end{alltt}
1617 \end{small}
1618 \caption{Use of the element \texttt{<\textbf{p}>}}
1619 \label{fig:p}
1620 \end{figure}
1621
1622 A pair \texttt{<\textbf{p}>} must include these two parts although one
1623 can be empty, which means deleting (or inserting) a string. The
1624 elements \texttt{<\textbf{l}>} and \texttt{<\textbf{r}>} have the same
1625 internal structure and the same requisites.  They can contain text and
1626 references to grammatical symbols (which, for the languages of the
1627 present project, inflected by suffixation, are usually placed at the
1628 end in any amount).  Outside the tags \texttt{<\textbf{l}>} and
1629 \texttt{<\textbf{r}>} of a string pair there is nothing.
1630
1631
1632 \paragraph{Element for reference to symbol \texttt{<s>}}
1633
1634 References to symbols (or tags) are used to specify the morphological
1635 information of a LF and are used in any place inside a string pair,
1636 that is, inside the elements \texttt{<\textbf{l}>} and
1637 \texttt{<\textbf{r}>}, as if they were individual characters; for the
1638 languages of our project, however, they are put at the end of the
1639 pairs and always in the same order for the same word type.  This order
1640 is decided by the linguist according to how he/she wishes to
1641 characterise morphologically the LF in the dictionaries, and must be
1642 the same in all the dictionaries of a system if we want that the
1643 lexical and structural transfer operations work correctly. So, for
1644 example, in the Romance language dictionaries of this project, a noun
1645 has in the first place the symbol for part of speech (\textit{n},
1646 noun), then for gender (\textit{m}, masculine, \textit{f}, feminine,
1647 \textit{mf}, masculine--feminine), and finally for number
1648 (\textit{sg}, singular, \textit{pl}, plural, \textit{sp},
1649 singular--plural).  The list in Appendix \ref{se:simbolosmorf}
1650 contains all the grammatical symbols used in the dictionaries of this
1651 project and shows the order which has been established for each type
1652 of word.
1653
1654 In morphological dictionaries, references to symbols are used in
1655 paradigms as well as in entries which do not include any reference to
1656 a paradigm. In bilingual dictionaries, usually only the first symbol
1657 of each LF is specified, since the rest is automatically copied from
1658 the source language LF to the target language LF (in the case they are
1659 identical in both languages).
1660
1661 To specify which symbol we are referring to, we use the (mandatory)
1662 attribute \texttt{\textsl{n}}.  The symbol must be defined in the
1663 symbol definition section (\texttt{<\textbf{sdefs}>}).
1664
1665
1666
1667
1668 \paragraph{Element for identity transduction \texttt{<i>}}
1669 \label{ss:i}
1670
1671 It is a way to write a string pair in which left side and right side
1672 are identical.  For example, the two entries shown in Figure
1673 \ref{fig:i} are completely equivalent. The advantage of writing
1674 entries with this element is that the result is more compact and more
1675 readable.
1676
1677 \begin{figure}
1678 \begin{small}
1679 \begin{alltt}
1680 [1]
1681
1682 <\textbf{e} \textsl{lm}="perro">
1683   <\textbf{p}>
1684     <\textbf{l}>perr</\textbf{l}><\textbf{r}>perr</\textbf{r}>
1685   </\textbf{p}>
1686   <\textbf{par} \textsl{n}="abuel_o__n"/>
1687 </\textbf{e}>
1688
1689 [2]
1690
1691 <\textbf{e} \textsl{lm}="perro">
1692   <\textbf{i}>perr</\textbf{i}>
1693   <\textbf{par} \textsl{n}="abuel_o__n"/>
1694 </\textbf{e}>
1695 \end{alltt}
1696 \end{small}
1697 \caption{Use of the element \texttt{<\textbf{i}>} entries [1] and [2]
1698 are equivalent}
1699 \label{fig:i}
1700 \end{figure}
1701
1702
1703
1704 \paragraph{Element for paradigm definition section \texttt{<pardefs>}}
1705
1706 This element includes all the paradigm definitions of a dictionary,
1707 each definition in an element \texttt{<\textbf{pardef}>}, as shown in
1708 Figure \ref{fig:pardefs}.
1709
1710 \begin{figure}
1711 \begin{small}
1712 \begin{alltt}
1713 <\textbf{pardefs}>
1714   <\textbf{pardef} \textsl{n}="abuel_o__n">
1715     <!-- ... -->
1716   </\textbf{pardef}>
1717   <!-- ... -->
1718 </\textbf{pardefs}>
1719 \end{alltt}
1720 \end{small}
1721 \caption{Use of the element \texttt{<\textbf{pardefs}>}}
1722 \label{fig:pardefs}
1723 \end{figure}
1724
1725
1726
1727 \paragraph{Element for paradigm definition \texttt{<pardef>}}
1728
1729
1730 It defines an inflection paradigm in the dictionary.  A paradigm can
1731 be understood as a small dictionary of alternative transformations
1732 that can be concatenated to parts of words (or to entries of another
1733 paradigm) to specify regularities in the lexical processing of the
1734 dictionary entries, such as inflection regularities. To specify these
1735 regularities, each paradigm is a list of entries \texttt{<\textbf{e}>}
1736 like the ones in the dictionary, that is, it has the same structure as
1737 a dictionary section \texttt{<\textbf{section}>}; therefore, paradigm
1738 entries consist of a pair (\texttt{<\textbf{p}>}) with left side
1739 (\texttt{<\textbf{l}>}) and right side (\texttt{<\textbf{r}>}). These
1740 elements can contain text or grammatical symbols
1741 \texttt{<\textbf{s}>}.
1742
1743
1744 As in symbol definitions, paradigm definitions have an attribute
1745 \texttt{\textsl{n}} which specifies the paradigm name, so that it can
1746 be referred to inside dictionary entries. In a dictionary entry,
1747 therefore, one only needs to indicate the corresponding paradigm name
1748 in order that all its possible forms get specified.
1749
1750 The example of paradigm definition pointed out in Figure
1751 \ref{fig:pardefs} appears developed in Figure \ref{fig:pardef}.  The
1752 following table shows the information expressed by the paradigm:
1753
1754 \begin{center}
1755  \begin{tabular}{|l|c|l|}
1756 \hline
1757 Root (SF and LF) & Ending (SF) & Analysis (LF) \\
1758 \hline
1759 \texttt{abuel} & \texttt{o} &\texttt{o<n><m><sg>}\\
1760 \texttt{abuel} & \texttt{a} &\texttt{o<n><f><sg>}\\
1761 \texttt{abuel} & \texttt{os} &\texttt{o<n><m><pl>}\\
1762 \texttt{abuel} & \texttt{as} &\texttt{o<n><f><pl>}\\
1763 \hline
1764 \end{tabular}
1765 \end{center}
1766
1767
1768 \begin{figure}
1769 \begin{small}
1770 \begin{alltt}
1771 <\textbf{pardef} \textsl{n}="abuel_o__n">
1772   <\textbf{e}>
1773     <\textbf{p}>
1774       <\textbf{l}>o</\textbf{l}>
1775       <\textbf{r}>o<\textbf{s} \textsl{n}="n"/><\textbf{s} \textsl{n}="m"/><\textbf{s} \textsl{n}="sg"/></\textbf{r}>
1776     </\textbf{p}>
1777   </\textbf{e}>
1778   <\textbf{e}>
1779     <\textbf{p}>
1780       <\textbf{l}>a</\textbf{l}>
1781       <\textbf{r}>o<\textbf{s} \textsl{n}="n"/><\textbf{s} \textsl{n}="f"/><\textbf{s} \textsl{n}="sg"/></\textbf{r}>
1782     </\textbf{p}>
1783   </\textbf{e}>
1784   <\textbf{e}>
1785     <\textbf{p}>
1786       <\textbf{l}>os</\textbf{l}>
1787       <\textbf{r}>o<\textbf{s} \textsl{n}="n"/><\textbf{s} \textsl{n}="m"/><\textbf{s} \textsl{n}="pl"/></\textbf{r}>
1788     </\textbf{p}>
1789   </\textbf{e}>
1790   <\textbf{e}>
1791     <\textbf{p}>
1792       <\textbf{l}>as</\textbf{l}>
1793       <\textbf{r}>o<\textbf{s} \textsl{n}="n"/><\textbf{s} \textsl{n}="f"/><\textbf{s} \textsl{n}="pl"/></\textbf{r}>
1794     </\textbf{p}>
1795   </\textbf{e}>
1796 </\textbf{pardef}>
1797 \end{alltt}
1798 \end{small}
1799 \caption{Use of the element \texttt{<\textbf{pardef}>} to define the
1800   inflective morphology of Spanish nouns with four endings, such as
1801   \emph{abuelo, -a, -os, -as} ("grandfather, grandmother") }
1802 \label{fig:pardef}
1803 \end{figure}
1804
1805 This paradigm is assigned to all Spanish nouns (\texttt{n}) that
1806 inflect like \emph{abuelo}, such as \emph{alumno}, \emph{amigo} or
1807 \emph{gato}, and is designed to be used as a \textit{suffix} in
1808 dictionary entries.  In general, paradigms can be applied to any
1809 position of a dictionary entry (if it makes sense, of course).  We can
1810 think of paradigms as transducers that are inserted at the point where
1811 they are specified.  Figure \ref{fig:pardef2} shows an example of paradigm
1812 defined to be used as a prefix. It is the paradigm used to analyse and
1813 generate Spanish words beginning with \emph{ex}, \emph{ex-}, etc.,
1814 like \emph{ex-presidente}, \emph{exministro}, \emph{ex director},
1815 etc., with all the orthographic variations (\emph{ex} with hyphen,
1816 without hyphen and joined, without hyphen and with a blank
1817 \texttt{<\textbf{b}/>}, see page~\ref{s3:b}); the output lemma simply
1818 adds \emph{ex} without hyphen nor blank to the accompanying lemma. The
1819 direction restrictions (\texttt{"LR"}) that appear in the example are
1820 used to determine which form will the translator generate. The empty
1821 identity transduction (\texttt{<\textbf{i}/>}) is necessary in this
1822 case to analyse and generate the word without the prefix \emph{ex}.
1823
1824 \begin{figure}
1825 \begin{small}
1826 \begin{alltt}
1827 <\textbf{pardef} \textsl{n}="ex">
1828   <\textbf{e} \textsl{r}="LR"><\textbf{p}><\textbf{l}>ex<\textbf{b}/></\textbf{l}><\textbf{r}>ex</\textbf{r}></\textbf{p}></\textbf{e}>
1829   <\textbf{e}><\textbf{i}>ex</\textbf{i}></\textbf{e}>
1830   <\textbf{e} \textsl{r}="LR"><\textbf{p}><\textbf{l}>ex-</\textbf{l}><\textbf{r}>ex</\textbf{r}></\textbf{p}></\textbf{e}>
1831   <\textbf{e}><\textbf{i}/></\textbf{e}>
1832 </\textbf{pardef}>
1833 \end{alltt}
1834 \end{small}
1835 \caption{Use of the element \texttt{<\textbf{pardef}>} in the paradigm
1836   for the prefix \emph{ex}.}
1837 \label{fig:pardef2}
1838 \end{figure}
1839
1840
1841 Entries in a paradigm can contain references to other paradigms
1842 provided that these have been defined upper in the file.  On the other
1843 hand, for the moment a paradigm definition can not include itself
1844 neither directly nor indirectly.
1845
1846 Paradigms are used in morphological dictionaries for the analysis and
1847 generation of lexical forms. For the language pairs of this project,
1848 there is no need to define paradigms in bilingual dictionaries (see
1849 page~\pageref{ss:bil}).
1850
1851 From Apertium 2 on, there is a new type of paradigm, called
1852 metaparadigm, that allows the definition of paradigms with variations
1853 according to the value of an attribute specified in each entry that
1854 refers to that paradigm. Section \ref{ss:metaparadigmas} describes the
1855 characteristics and use of metaparadigms.
1856
1857
1858
1859 \paragraph{Element for reference to a paradigm \texttt{<par>}}
1860 \label{ss:par}
1861
1862 It is used inside an entry to indicate which inflection paradigm,
1863 among the ones defined in \texttt{<\textbf{pardefs}>}, follows the
1864 entry. Thanks to the references to paradigms there is no need to write
1865 all the inflected forms of a lemma in a morphological dictionary
1866 entry.  The attribute \texttt{\textsl{n}} is used to specify the name
1867 of the paradigm we want to refer to.
1868
1869 The result of inserting a reference to a paradigm in an entry is the
1870 creation of so many string pairs as cases specified in the
1871 paradigm. For example, the entry in Figure \ref{fig:par}, with a
1872 reference to the paradigm "\texttt{abuel\_o\_\_n}" (defined in Figure
1873 \ref{fig:pardef}), is equivalent to an entry where each string pair of
1874 the paradigm is concatenated to the lemma (that is, an entry with
1875 every inflected form of the lemma), as shown in Figure
1876 \ref{fig:lema_par}. In this figure, you can see that the paradigm
1877 delivers always in the right string (\texttt{<\textbf{r}>}) the lemma
1878 (\emph{perro}) with the grammatical symbols that apply to the surface
1879 form, since it is from the lemma that transfer operations are carried
1880 out.
1881
1882
1883 The appropriate use of paradigms, besides enabling the creation of
1884 compact dictionaries, improves compilation speed and reduces memory
1885 requirements during this process, since in compilation it is possible
1886 to create a single data structure for each one of most paradigms
1887 \cite{ortiz05j}.
1888
1889 \begin{figure}
1890 \begin{small}
1891 \begin{alltt}
1892 <\textbf{e} \textsl{lm}="perro">
1893   <\textbf{i}>perr</\textbf{i}>
1894   <\textbf{par} \textsl{n}="abuel_o__n"/>
1895 </\textbf{e}>
1896 \end{alltt}
1897 \end{small}
1898 \caption{Use of the element \texttt{<\textbf{par}>}}
1899 \label{fig:par}
1900 \end{figure}
1901
1902 \begin{figure}
1903 \begin{small}
1904 \begin{alltt}
1905  <\textbf{e}>
1906    <\textbf{p}>
1907      <\textbf{l}>perro</\textbf{l}>
1908      <\textbf{r}>perro<\textbf{s} \textsl{n}="n"/><\textbf{s} \textsl{n}="m"/><\textbf{s} \textsl{n}="sg"/></\textbf{r}>
1909    </\textbf{p}>
1910  </\textbf{e}>
1911  <\textbf{e}>
1912    <\textbf{p}>
1913      <\textbf{l}>perra</\textbf{l}>
1914      <\textbf{r}>perro<\textbf{s} \textsl{n}="n"/><\textbf{s} \textsl{n}="f"/><\textbf{s} \textsl{n}="sg"/></\textbf{r}>
1915    </\textbf{p}>
1916  </\textbf{e}>
1917  <\textbf{e}>
1918    <\textbf{p}>
1919      <\textbf{l}>perros</\textbf{l}>
1920      <\textbf{r}>perro<\textbf{s} \textsl{n}="n"/><\textbf{s} \textsl{n}="m"/><\textbf{s} \textsl{n}="pl"/></\textbf{r}>
1921    </\textbf{p}>
1922  </\textbf{e}>
1923  <\textbf{e}>
1924    <\textbf{p}>
1925      <\textbf{l}>perras</\textbf{l}>
1926      <\textbf{r}>perro<\textbf{s} \textsl{n}="n"/><\textbf{s} \textsl{n}="f"/><\textbf{s} \textsl{n}="pl"/></\textbf{r}>
1927    </\textbf{p}>
1928  </\textbf{e}>
1929 \end{alltt}
1930 \end{small}
1931 \caption{Entry equivalent to the one in Figure \ref{fig:par}, that
1932   shows the result of inserting the reference to paradigm
1933   \texttt{<\textbf{par}>} with the paradigm defined in Figure
1934   \ref{fig:pardef}.}
1935 \label{fig:lema_par}
1936 \end{figure}
1937
1938
1939
1940 \paragraph{Element for regular expression \texttt{<re>}}
1941 \label{ss:re}
1942
1943 In natural languages too there are patterns that can be recognized as
1944 regular expressions: for example, punctuation marks, numbers (Latin or
1945 Roman), e-mail or web page addresses, or any kind of code identifiable
1946 through these mechanisms.
1947
1948
1949 For this cases we use the string contained in the tag
1950 \texttt{<\textbf{re}>}.  The compiler reads the regular expression
1951 definition and transforms it in a transducer that is inserted in the
1952 rest of the dictionary and that translates all the strings that match
1953 the expression into identical strings.
1954
1955 The syntax of the present implementation of these regular expressions
1956 processes a subgroup of Unix regular expressions, which includes the
1957 operators \texttt{*}, \texttt{?}, \texttt{|} and \texttt{+}, as well
1958 as groupings through parentheses and optional character ranks, for
1959 example \texttt{[a-zA-zñú]} or its negated versions, like
1960 \verb![^a-z]!.
1961
1962 By analogy, they can be seen as \texttt{<\textbf{i}>} elements, with
1963 the difference that they can identify strings which may be infinite
1964 (like numbers).
1965
1966 \begin{figure}
1967 \begin{small}
1968 \begin{alltt}
1969 <\textbf{e}>
1970   <\textbf{re}>[0-9]+([.,][0-9]+)?(\%)?</\textbf{re}>
1971   <\textbf{p}><\textbf{l}/><\textbf{r}><\textbf{s} \textsl{n}="num"/></\textbf{r}></\textbf{p}>
1972 </\textbf{e}>
1973 \end{alltt}
1974 \end{small}
1975 \caption{Us of the element \texttt{<\textbf{re}>} in an entry for the
1976 detection of Arabic numbers.}
1977 \label{fig:e}
1978 \end{figure}
1979
1980 Figure \ref{fig:e} shows the way to tag quantities expressed as Arabic
1981 numbers in the dictionary.
1982
1983
1984 \paragraph{Element for blank block \texttt{<b>}}
1985 \label{s3:b}
1986
1987 It is used to express the presence of blanks between the words of a
1988 multiword (see page~\pageref{ss:multipalabras} for an explanation on
1989 multiwords). It can be inserted in the \texttt{<\textbf{i}>},
1990 \texttt{<\textbf{l}>} and \texttt{<\textbf{r}>} elements.  In Figure
1991 \ref{fig:b} you can see the entry for the Spanish multiword expression
1992 \emph{hoy en día} ("nowadays"): the blanks between words are expressed
1993 as \texttt{<\textbf{b}/>} elements inside the left and right strings.
1994 \begin{figure}
1995 \begin{small}
1996 \begin{alltt}
1997 <\textbf{e} \textsl{lm}="hoy en día">
1998   <\textbf{p}>
1999     <\textbf{l}>hoy<\textbf{b}/>en<\textbf{b}/>día</\textbf{l}>
2000     <\textbf{r}>hoy<\textbf{b}/>en<\textbf{b}/>día<\textbf{s} \textsl{n}="adv"/></\textbf{r}>
2001   </\textbf{p}>
2002 </\textbf{e}>
2003 \end{alltt}
2004 \end{small}
2005 \caption{Use of the element \texttt{<\textbf{b}>}}
2006 \label{fig:b}
2007 \end{figure}
2008
2009 Blanks can consist of normal space characters or of document format
2010 information blocks encapsulated by the de-formatter
2011 (\textit{superblanks}, see Section \ref{ss:formato}).
2012
2013
2014 \paragraph{Element for post-generator activation \texttt{<a>}}
2015 \label{ss:a} The element \texttt{<\textbf{a}>} for the activation of
2016 the post-generator is used to indicate that a word in target language
2017 may undergo orthographic transformations due to the contact with other
2018 words; for example, being apostrophized, contracted, written without
2019 intermediate spaces, etc.  These transformations need be carried out
2020 after the generation of the target language surface forms, as until
2021 then words are isolated and it is not possible to know which words
2022 will get in contact . Therefore, these operations must be carried out
2023 by the module next to the generator, which is called
2024 post-generator. In order to signal which words are to be processed by
2025 the post-generator, this element is used in the surface form side of
2026 these entries in the morphological dictionary.
2027
2028 The example in Figure \ref{fig:a} shows its use, in a Catalan
2029 morphological dictionary, for the preposition \textit{de}, which, when
2030 appearing before a singular or plural masculine definite article
2031 (\textit{el, els}), forms a contraction (\textit{del, dels}).  The
2032 presence of the tag \texttt{<\textbf{a}/>} causes the activation of
2033 the post-generator, which checks whether the preposition is followed
2034 by one of the words that cause it to contract and, if it is so, makes
2035 the contraction (see page~\pageref{ss:postgen} for more details). The
2036 restriction \texttt{RL} indicates that this is an only-generation
2037 entry, since it does not make any sense for the analysis.
2038
2039 \begin{figure}
2040 \begin{small}
2041 \begin{alltt}
2042 <\textbf{e} \textsl{r}="RL" \textsl{lm}="de">
2043    <\textbf{p}>
2044       <\textbf{l}><\textbf{a}/>de</\textbf{l}>
2045       <\textbf{r}>de<\textbf{s} \textsl{n}="pr"/></\textbf{r}>
2046    </\textbf{p}>
2047 </\textbf{e}>
2048 \end{alltt}
2049 \end{small}
2050 \caption{Use of the element \texttt{<\textbf{a}>} in a morphological
2051 dictionary}
2052 \label{fig:a}
2053 \end{figure}
2054
2055
2056
2057 \paragraph{Element for group marking \texttt{<g>}}
2058
2059 This element is used, inside the \texttt{<\textbf{l}>} and
2060 \texttt{<\textbf{r}>} elements, to define groups that require a
2061 special treatment beyond the normal word by word processing. It is
2062 used in inflective multiwords to signal the beginning and the end of
2063 the group of invariable lexical forms (one or more) that are adjacent to the
2064 inflected word and that, together with it, build an inseparable
2065 unit. In Section~\ref{ss:multipalabras} you will find a detailed
2066 explanation of the different multiword types, and in Figure
2067 \ref{fig:hacertilin} of that section you can see an example of its
2068 use.
2069
2070
2071
2072 \paragraph{Element for joining of lexical forms \texttt{<j>}}
2073 \label{ss:j}
2074
2075 This element is used only in the right side of an entry
2076 (\texttt{<\textbf{r}>}) to indicate that the words that form a
2077 multiword are treated as individual lexical forms and, therefore, have
2078 a grammatical symbol each. This way, this multiword will be processed
2079 as a unit by the analyser and by the tagger until it reaches the
2080 auxiliary module \texttt{pretransfer} (see section
2081 \ref{se:pretransfer}), which is responsible for separating the lexical
2082 forms it is made of so that they reach the transfer module as
2083 independent forms. If the linguist wants that these forms reach the
2084 generator as joined forms, building again a multiword, it is necessary
2085 to define a structural transfer rule that groups them in a multiword
2086 (see Section \ref{formatotransfer}). If, on the contrary, these joined
2087 forms must be only for the analysis, the entry must have the
2088 restriction \texttt{LR}.
2089
2090 In Section~\ref{ss:multipalabras} you will find a more detailed
2091 explanation of this element. An example of its use can be found in
2092 Figure \ref{fig:cont} of the mentioned section.
2093
2094 \subsubsection{Modification of bilingual dictionaries for the new
2095 lexical selection module}
2096 \label{dic_lextor}
2097
2098 In 2007, a new module has been added to the Apertium system: the
2099 lexical selection module, which is described in section
2100 \ref{se:seleccio_lex}.
2101
2102 In order for them to work in a lexical selection system, bilingual
2103 dictionaries must be slightly modified so that they allow the
2104 specification of more than one translation in target language. The
2105 only change is the addition of two new attributes to the element
2106 \texttt{<e>}.  Although these new attributes can be used in all the
2107 dictionaries of a system, they only make sense in a bilingual
2108 dictionary entry.
2109
2110
2111 In Appendix~\ref{dixdtd} there is the part of the DTD \texttt{dix.dtd}
2112 \nota{MG: no caldria ajuntar les dues DTDs en una de sola?}  where the
2113 element \texttt{e} used for dictionary entries is defined.  The new
2114 attributes are:
2115 \begin{description}
2116 \item[slr (\emph{sense from left to right})] is used to specify the
2117 \emph{translation mark} when there is more than one translation from
2118 left to right for the lemma specified in the left side of an
2119 entry. The attribute can receive any value; however, the recommended
2120 action is to assign as value the lemma contained in the right part
2121 \texttt{<r>} (the translation of the lemma).
2122 \item[srl (\emph{sense from right to left})] is used to specify the
2123 \emph{translation mark} when there is more than one translation from
2124 right to left for the lemma specified in the right side of an entry.
2125 As before, the attribute can receive any value, but the recommended
2126 action is to assign as value the lemma contained in the left part
2127 \texttt{<l>} (the translation of the lemma).
2128 \end{description}
2129
2130 Furthermore, in both cases the value of the attribute can end in a
2131 white space and the letter ``D'' to indicate that this is the default
2132 translation, that is, the translation that will be chosen when there
2133 is not enough information to make a decision. It is compulsory that,
2134 for entries that have more than one equivalent in target language, one
2135 of the equivalents, and only one, is marked with the letter ``D'' for
2136 \emph{default}.
2137
2138 The following example shows how the new attributes are used.  We take
2139 as example a bilingual English-Catalan dictionary, with the following
2140 entries having more than one translation in the target language:
2141 \begin{itemize}
2142 \item \emph{look}: can be translated into Catalan as \emph{mirar}
2143 (default) or as \emph{semblar} (according to the English senses
2144 \emph{view/seem}),
2145 \item \emph{floor}: can be translated into Catalan as \emph{pis}
2146 (default) or as \emph{terra} (according to the English senses
2147 \emph{level of building/ground}),
2148 \item \emph{pis}: can be translated into English as \emph{flat}
2149 (default) or as \emph{floor}.
2150 \end{itemize}
2151
2152 This information is represented by means of the two attributes
2153 described:\label{entrades_lextor}
2154 \begin{alltt}
2155 \begin{small}
2156 <e srl="flat D">
2157    <p>
2158       <l>flat<s n="n"/></l>
2159       <r>pis<s n="n"/><s n="m"/></r>
2160     </p>
2161 </e>
2162
2163 <e slr="pis D" srl="floor">
2164    <p>
2165       <l>floor<s n="n"/></l>
2166       <r>pis<s n="n"/><s n="m"/></r>
2167    </p>
2168 </e>
2169
2170 <e slr="terra">
2171    <p>
2172       <l>floor<s n="n"/></l>
2173       <r>terra<s n="n"/><s n="m"/></r>
2174    </p>
2175 </e>
2176
2177 <e slr="mirar D">
2178    <p>
2179       <l>look<s n="vblex"/></l>
2180       <r>mirar<s n="vblex"/></r>
2181    </p>
2182 </e>
2183
2184 <e slr="semblar">
2185    <p>
2186       <l>look<s n="vblex"/></l>
2187       <r>semblar<s n="vblex"/></r>
2188    </p>
2189 </e>
2190 \end{small}
2191 \end{alltt}
2192
2193
2194 %\settocdepth{paragraph}
2195
2196 \subsubsection{Particularities of the different dictionary types}
2197 \label{ss:morfgen}
2198
2199 Dictionary entries have different characteristics depending on the
2200 dictionary type. Although some of these characteristics have been
2201 presented in the previous sections, we are going to describe them here
2202 more exhaustively.
2203
2204
2205 \paragraph{Morphological dictionaries}
2206
2207 In these dictionaries, used to generate the system's morphological
2208 analysers and generators, it is necessary to mark with
2209 \texttt{<\textbf{a}/>} those surface forms which, once generated, may
2210 need certain orthographic transformations due to the contact with
2211 other words; these operations are carried out by the post-generator.
2212 As these marks are only generated, the entries containing them must be
2213 only for the generation, which means that need to have the restriction
2214 \texttt{\textsl{r}=}\verb!"RL"! (from right to left).  Figure
2215 \ref{fig:a} shows an entry containing this element.
2216
2217
2218
2219 \paragraph{Bilingual dictionaries}
2220 \label{ss:bil}
2221
2222 As explained before, we have not used paradigms in the bilingual
2223 dictionaries of our system; these dictionaries are built with generic
2224 entries in which, almost always, only part of speech is specified, and
2225 which do not have inflection information. For example, in the
2226 \texttt{es-ca} dictionary, the entry for the Spanish words
2227 \textit{pan}, \textit{panes} ("bread"), translated into Catalan as
2228 \textit{pa}, \textit{pans}, would be as shown in Figure \ref{fg:pan}.
2229
2230 \begin{figure}
2231 \begin{small}
2232 \begin{alltt}
2233 <\textbf{e}>
2234   <\textbf{p}>
2235     <\textbf{l}>pan<\textbf{s} \textsl{n}="n"/></\textbf{l}>
2236     <\textbf{r}>pa<\textbf{s} \textsl{n}="n"/></\textbf{r}>
2237   </\textbf{p}>
2238 </\textbf{e}>
2239 \end{alltt}
2240 \end{small}
2241 \caption{Bilingual dictionary entry for the translation \emph{pan}
2242 (\texttt{es})--\emph{pa} (\texttt{ca})}
2243 \label{fg:pan}
2244 \end{figure}
2245
2246
2247 As you can see in the figure, only the first grammatical symbol
2248 \texttt{<\textbf{s} \textsl{n}="\ldots}\texttt{"}\texttt{/>} of each
2249 word is specified, since the unspecified symbols that come after the
2250 specified ones in the bilingual dictionary are copied from the source
2251 lexical form to the target lexical form. This entry, therefore, works
2252 both for \textit{pan} (singular) and for \textit{panes} (plural): the
2253 morphological analyser delivers the lemma (\emph{pan}) followed by the
2254 grammatical symbols that apply to the analysed surface form (\emph{n m
2255 sg} or \emph{n m pl} as applicable), and the symbols that are not
2256 specified in the bilingual entry (\emph{m sg} or \emph{m pl}) are
2257 copied to the target language. This is valid for both translation
2258 directions.  The idea is to specify the information indispensable to
2259 differentiate the entries, and the rest is \textit{deduced}
2260 (copied). It is important to bear this in mind, because, when there
2261 are differences between the grammatical symbols of a lexical form from
2262 SL to TL, these differences must be specified in the bilingual
2263 dictionary.  For example, when between source word and translated word
2264 there is a gender or number change, one has to specify the grammatical
2265 symbols in order (the order in which these symbols appear in the
2266 morphological dictionaries)\footnote{To know which grammatical symbols
2267 have been used in the dictionaries and in which order, see Appendix
2268 \ref{se:simbolosmorf}.} until the symbol that changes between SL and
2269 TL is reached.
2270
2271 For example, to translate the Spanish word \textit{cama}, feminine
2272 noun, into the Catalan word \textit{llit}, masculine noun, the entry
2273 in the bilingual dictionary must be as shown in Figure
2274 \ref{fg:cama}. The gender must be specified (\emph{f}, \emph{m})
2275 because, if not, the symbols for gender and number would be copied
2276 from the SL lexical form into de TL lexical form. Therefore, when
2277 translating from \texttt{es} to \texttt{ca}, we would obtain the
2278 lexical form \emph{llit} with the symbols \texttt{n f sg} or \texttt{n
2279 f pl}. In both cases, the generator would receive as input a word that
2280 is impossible to generate, since the Catalan morphological dictionary
2281 does not contain any entry with lemma \emph{llit} and feminine gender.
2282
2283
2284 \begin{figure}
2285 \begin{small}
2286 \begin{alltt}
2287 <\textbf{e}>
2288   <\textbf{p}>
2289     <\textbf{l}>cama<\textbf{s} \textsl{n}="n"/><\textbf{s} \textsl{n}="f"/></\textbf{l}>
2290     <\textbf{r}>llit<\textbf{s} \textsl{n}="n"/><\textbf{s} \textsl{n}="m"/></\textbf{r}>
2291   </\textbf{p}>
2292 </\textbf{e}>
2293 \end{alltt}
2294 \end{small}
2295 \caption{Bilingual dictionary entry for the translation \emph{cama}
2296 (\texttt{es})--\emph{llit} (\texttt{ca})}
2297 \label{fg:cama}
2298 \end{figure}
2299
2300
2301 In this example, the number symbols are not specified; therefore, it
2302 works for the correspondence \textit{cama--llit} (singular) as well as
2303 for \textit{camas--llits} (plural).  However, when there is a number
2304 change, the only way is to specify also the gender if the order used
2305 in all the dictionary for grammatical symbols is \emph{gender,
2306 number}.
2307
2308
2309
2310 By means of a direction restriction \texttt{r} we can indicate which
2311 translations are to be done only in one direction and not in the other
2312 one (see the description of the restrictions \texttt{LR} and
2313 \texttt{RL} in page \pageref{restric}).  This is necessary when the
2314 correspondence between two lexical forms is not symmetrical; in such
2315 case, in the bilingual dictionary two or more entries have to be
2316 created and a direction restriction must be applied, like in the
2317 example shown in Figure~\ref{fg:postre}. In this example, when
2318 translating from Spanish to Catalan (\texttt{LR}), we must generate
2319 only plural forms, since the word \textit{postres} ("dessert" ) in
2320 Catalan does not have singular form.  But, on the other hand, we will
2321 translate into Spanish only in plural form (although in Spanish the
2322 word has singular and plural forms), since it is not possible to
2323 determine, from the Catalan word, whether the number should be
2324 singular or plural.
2325
2326 \begin{figure}[htbp]
2327 \begin{small}
2328 \begin{alltt}
2329 <\textbf{e} \textsl{r}="LR">
2330   <\textbf{p}>
2331     <\textbf{l}>postre<\textbf{s} \textsl{n}="n"/><\textbf{s} \textsl{n}="m"/><\textbf{s} \textsl{n}="sg"/></\textbf{l}>
2332     <\textbf{r}>postres<\textbf{s} \textsl{n}="n"/><\textbf{s} \textsl{n}="m"/><\textbf{s} \textsl{n}="pl"/></\textbf{r}>
2333   </\textbf{p}>
2334 </\textbf{e}>
2335
2336 <\textbf{e}>
2337   <\textbf{p}>
2338     <\textbf{l}>postre<\textbf{s} \textsl{n}="n"/><\textbf{s} \textsl{n}="m"/><\textbf{s} \textsl{n}="pl"/></\textbf{l}>
2339     <\textbf{r}>postres<\textbf{s} \textsl{n}="n"/><\textbf{s} \textsl{n}="m"/><\textbf{s} \textsl{n}="pl"/></\textbf{r}>
2340   </\textbf{p}>
2341 </\textbf{e}>
2342 \end{alltt}
2343 \end{small}
2344 \caption{Entries in the Spanish-Catalan bilingual dictionary for the
2345 correspondence \emph{postre}--\emph{postres} ("dessert")}
2346 \label{fg:postre}
2347 \end{figure}
2348
2349
2350 \label{pg:GD} There is another problem due to grammatical divergences
2351 between two languages that is resolved with the help of two special
2352 symbols, \texttt{GD} (for \textit{gender to be determined}) and
2353 \texttt{ND} (for \textit{number to be determined}), symbols which have
2354 to be defined in the symbol section of the bilingual dictionary. This
2355 problem arises when the grammatical information of a SL lexical form
2356 is not enough to determine the gender (masculine or feminine) or the
2357 number (singular or plural) of the TL lexical form.  Let's put an
2358 example: the Spanish adjective \textit{común} ("common") is masculine
2359 and feminine at the same time (and, therefore, masculine--feminine,
2360 \texttt{mf}), but in Catalan the adjective has different forms for the
2361 masculine, \textit{comú}/\textit{comuns}, and the feminine,
2362 \textit{comuna}/\textit{comunes}.  In the bilingual dictionary, the
2363 entry should be as shown in Figure~\ref{fg:comuna}: in the \texttt{LR}
2364 direction (from Spanish to Catalan), the gender information is not
2365 \texttt{m}, \texttt{f} nor \texttt{mf} but \texttt{GD}; this
2366 \textit{gender to be determined} will be determined next by the
2367 structural transfer module, by means of the application of the
2368 suitable transfer rules (usually, rules for the agreement between the
2369 lexical forms in a pattern; see Section \ref{ss:transfer} to obtain a
2370 detailed description of transfer rules). In an analogous way, a
2371 similar mechanism exists for singular--plural using the symbol
2372 \texttt{ND} (for example, in Spanish \textit{análisis} ("analysis") is
2373 singular and plural, whereas in Catalan the singular form is
2374 \textit{anàlisi} and the plural form \textit{anàlisis}).
2375
2376
2377 \begin{figure}[htbp]
2378 \begin{small}
2379 \begin{alltt}
2380 <\textbf{e} \textsl{r}="LR">
2381   <\textbf{p}>
2382     <\textbf{l}>común<\textbf{s} \textsl{n}="adj"/><\textbf{s} \textsl{n}="mf"/></\textbf{l}>
2383     <\textbf{r}>comú<\textbf{s} \textsl{n}="adj"/><\textbf{s} \textsl{n}="GD"/></\textbf{r}>
2384   </\textbf{p}>
2385 </\textbf{e}>
2386
2387 <\textbf{e} \textsl{r}="RL">
2388   <\textbf{p}>
2389     <\textbf{l}>común<\textbf{s} \textsl{n}="adj"/><\textbf{s} \textsl{n}="mf"/></\textbf{l}>
2390     <\textbf{r}>comú<\textbf{s} \textsl{n}="adj"/><\textbf{s} \textsl{n}="m"/></\textbf{r}>
2391   </\textbf{p}>
2392 </\textbf{e}>
2393
2394 <\textbf{e} \textsl{r}="RL">
2395   <\textbf{p}>
2396     <\textbf{l}>común<\textbf{s} \textsl{n}="adj"/><\textbf{s} \textsl{n}="mf"/></\textbf{l}>
2397     <\textbf{r}>comú<\textbf{s} \textsl{n}="adj"/><\textbf{s} \textsl{n}="f"/></\textbf{r}>
2398   </\textbf{p}>
2399 </\textbf{e}>
2400 \end{alltt}
2401 \end{small}
2402 \caption{Entries in the Spanish--Catalan bilingual dictionary for the
2403   correspondence \emph{común}--\emph{comú} ("common"), the first one
2404   for the translation from Spanish to Catalan and the two others for
2405   the translation from Catalan to Spanish}
2406 \label{fg:comuna}
2407
2408
2409 \end{figure}
2410
2411
2412
2413 \paragraph{Post-generation dictionaries}
2414 \label{ss:postgen}
2415
2416
2417
2418 In the morphological dictionary, the lexical forms which, once
2419 generated, may undergo contraction, apostrophation or other
2420 transformations, depending of which words are in contact with them in
2421 the output text, must have the post-generator activation mark
2422 (\texttt{<\textbf{a}/>}, see page \pageref{ss:a}) in the generation
2423 entry (\texttt{RL} direction).  It is essential that the surface forms
2424 marked with the post-generator activation mark are identical in the
2425 morphological and the post-generation dictionaries of the same
2426 translator. In the post-generation dictionary, all entries begin with
2427 this activation mark.
2428
2429
2430 In Figure~\ref{fg:postgen} there is an extract of the Spanish
2431 post-generator; the example shows how the contraction for \textit{de}
2432 and \textit{el} is done, to form the word \textit{del}.  The paradigm
2433 \texttt{puntuación} not defined in the example contains the
2434 non-alphabetic characters that can appear in a text. We can see in the
2435 example that the entry for the preposition \emph{de} has the mark
2436 \texttt{<\textbf{a}/>}. The paradigm assigned to this entry,
2437 "\texttt{el}", is the one defined just above. According to this entry,
2438 when the system receives as input the left string of the entry (the part
2439 between \texttt{<\textbf{l}>}) concatenated to the left string of
2440 the paradigm (that is, when the input is
2441 \texttt{"}\texttt{<a/>\textbf{de}<b/>\textbf{el}<b/>"} or
2442 \texttt{"}\texttt{<a/>\textbf{de}\\<b/>\textbf{el}[puntuación]}\texttt{"}),
2443 the module delivers as output string (the part between \texttt{<r>}
2444 elements) the string \texttt{"}\textbf{del}\texttt{"} followed by the
2445 blanks represented with \texttt{<b/>} or by the symbols represented
2446 with \texttt{[puntu\-a\-ción]}. Note that, in the module output, all
2447 the marks \texttt{<\textbf{a}/>} have been removed.
2448
2449
2450
2451
2452
2453 \begin{figure}[htbp]
2454 \begin{small}
2455 \begin{alltt}
2456 <\textbf{dictionary}>
2457 <\textbf{pardefs}>
2458   ...
2459   <\textbf{pardef} \textsl{n}="el">
2460     <\textbf{e}>
2461       <\textbf{p}>
2462         <\textbf{l}>el<\textbf{b}/></\textbf{l}>
2463         <\textbf{r}>l<\textbf{b}/></\textbf{r}>
2464       </\textbf{p}>
2465     </\textbf{e}>
2466     <\textbf{e}>
2467       <\textbf{p}>
2468         <\textbf{l}>el</\textbf{l}>
2469         <\textbf{r}>l</\textbf{r}>
2470       </\textbf{p}>
2471       <\textbf{par} \textsl{n}="puntuación"/>
2472     </\textbf{e}>
2473   </\textbf{pardef}>
2474   ...
2475 </\textbf{pardefs}>
2476 <\textbf{section} \textsl{id}="main" \textsl{type}="standard">
2477   ...
2478   <\textbf{e}>
2479     <\textbf{p}>
2480       <\textbf{l}><\textbf{a}/>de<\textbf{b}/></\textbf{l}>
2481       <\textbf{r}>de</\textbf{r}>
2482     </\textbf{p}>
2483     <\textbf{par} \textsl{n}="el"/>
2484   </\textbf{e}>
2485   ...
2486 </\textbf{section}/>
2487 </\textbf{ditionary}>
2488 \end{alltt}
2489 \end{small}
2490 \caption{Post-generation dictionary data to perform the contraction
2491   for Spanish \emph{de} + \emph{el} = \emph{del} .}
2492 \label{fg:postgen}
2493 \end{figure} \nota{en l'exemple, "el" no ha de portar la marca
2494 d'activació oi? - l'he treta de l'exemple, treure-la dels diccionaris
2495 (Mikel?)}
2496
2497
2498 %\settocdepth{subsubsection}
2499
2500
2501 \subsubsection{Multiword lexical units}
2502 \label{ss:multipalabras}
2503
2504
2505 The designed dictionary format allows the creation of
2506 \textit{multiword lexical units} ---in short, \textit{multiwords}---
2507 of different kinds, depending on the problem to be approached.
2508
2509 In this project we have considered three basic types of multiwords:
2510 \begin{enumerate}
2511 \item The most simple case are \textit{multiwords without inflection},
2512   which consist of only one lexical form: the lemma is made of two or
2513   more invariable orthographic words but it is tagged as a unit.
2514   Figure \ref{fig:msf} shows an example of invariable multiword (the
2515   Spanish expression \emph{hoy en día}, "nowadays"): It is made of
2516   three words separated by a blank (\texttt{<\textbf{b}/>}) and,
2517   although it actually consists of an adverb, a preposition and a
2518   noun, it is tagged as an adverb as a whole, since it acts as one.
2519
2520 \begin{figure}
2521 \begin{small}
2522 \begin{alltt}
2523 <\textbf{e} \textsl{lm}="hoy en día">
2524   <\textbf{p}>
2525     <\textbf{l}>hoy<\textbf{b}/>en<\textbf{b}/>día</\textbf{l}>
2526     <\textbf{r}>hoy<\textbf{b}/>en<\textbf{b}/>día<\textbf{s} \textsl{n}="adv"/></\textbf{r}>
2527   </\textbf{p}>
2528 </\textbf{e}>
2529 \end{alltt}
2530 \end{small}
2531 \caption{Example of multiword without inflection in the morphological
2532 dictionary}
2533 \label{fig:msf}
2534 \end{figure}
2535
2536 \item A more complicated issue is the case of \textit{compound
2537   multiwords}, made of more than one lexical form, each one with its
2538   grammatical symbols. The words they are made of are considered not
2539   to build a semantic unit like in the previous case, but to appear
2540   together building a unit due to contact reasons (phonetic or
2541   orthographic reasons). In this category we include
2542   \textit{contractions} and \textit{enclitic pronouns} accompanying
2543   verbs.  To mark this phenomenon we use the tag \texttt{<\textbf{j}>}
2544   described in page~\pageref{ss:j}. You can see an example in
2545   Figure~\ref{fig:cont}, in which the analysis of \emph{del} delivers
2546   a lexical multiform made of two lexical forms: \emph{de},
2547   preposition, and \emph{el}, singular masculine definite determiner,
2548   linked with the \texttt{<\textbf{j}/>} element. The analyser and the
2549   part-of-speech tagger handle this multiwords as a unit; however,
2550   before entering the transfer module, they are processed by an
2551   auxiliary module called \texttt{pretransfer} (see section
2552   \ref{se:pretransfer}) which is responsible for separating the
2553   lexical forms they are made of. This way, they reach the transfer
2554   module as independent forms; the linguist has to decide whether they
2555   have to be joined again (which must be done in the structural
2556   transfer module) or they have to remain as independent forms through
2557   the next modules.
2558
2559
2560   In our system, the elements forming a contraction continue as
2561 independent forms, and the post-generator is responsible for making
2562 the contractions in the target language if it is necessary. On the
2563 other hand, enclitic pronouns are joined again to the verb by means of
2564 a structural transfer rule (see Section \ref{ss:transfer}), so the
2565 verb plus its enclitic pronouns get into the generation module as a
2566 single lexical multiform, its components joined with a
2567 \texttt{<\textbf{j}/>}. Therefore, entries containing enclitic
2568 pronouns must not have any direction restriction, as can be seen in
2569 the example in Figure \ref{fig:encl}, which shows a part of the
2570 paradigm for the Spanish verb "dar" ("to give"), specifically the
2571 entry for the infinitive form joined to an enclitic pronoun.
2572
2573
2574 \begin{figure}
2575 \begin{small}
2576 \begin{alltt}
2577 <\textbf{e} \textsl{lm}="del" \textsl{r}="LR">
2578   <\textbf{p}>
2579     <\textbf{l}>del</\textbf{l}>
2580     <\textbf{r}>de<\textbf{s} \textsl{n}="pr"/><\textbf{j}/>
2581      el<\textbf{s} \textsl{n}="det"/><\textbf{s} \textsl{n}="def"/>
2582      <\textbf{s} \textsl{n}="m"/><\textbf{s} \textsl{n}="sg"/></\textbf{r}>
2583   </\textbf{p}>
2584 </\textbf{e}>
2585 \end{alltt}
2586 \end{small}
2587 \caption{Entry in the morphological dictionary for the analysis of a
2588 contraction (the Spanish contraction \emph{del})}
2589 \label{fig:cont}
2590 \end{figure}
2591
2592 \begin{figure}
2593 \begin{small}
2594 \begin{alltt}
2595 <\textbf{e}>
2596   <\textbf{p}>
2597     <\textbf{l}>ar</\textbf{l}>
2598     <\textbf{r}>ar<\textbf{s} \textsl{n}="vblex"/><\textbf{s} \textsl{n}="inf"/><\textbf{j}/></\textbf{r}>
2599   </\textbf{p}>
2600   <\textbf{par} \textsl{n}="S__cantar"/>
2601 </\textbf{e}>
2602 \end{alltt}
2603 \end{small}
2604 \caption{A fragment of the inflection paradigm for the Spanish verb
2605 \emph{dar} ("to give"), which shows the entry for the infinitive form
2606 followed by an enclitic pronoun. Enclitic pronouns are contained in
2607 the paradigm \texttt{S\_\_cantar}. Note that, unlike in Figure
2608 \ref{fig:cont}, this entry is both for analysis and generation.}
2609 \label{fig:encl}
2610 \end{figure}
2611
2612
2613
2614 \item The most complicated case in our system is the case of
2615   \textit{multiwords with inner inflection} inside the lemma (or
2616   "split lemma" forms), like the example shown in Figure
2617   \ref{fig:echardemenos}. The lemma of this kind of multiwords has one
2618   part with inflection (the \emph{lemma head}) followed by one
2619   invariable part (the \emph{lemma tail}).  The invariable part has to
2620   be put between \texttt{<\textbf{g}>} elements, so that it can be
2621   moved to the position immediately after the lemma head to obtain the
2622   whole lemma of the multiword. For example, the lemma of the Spanish
2623   multiwords \emph{echó de menos} ("he/she missed"), \emph{echándole
2624   de menos} ("missing him/her"), etc.  has to be \emph{echar de menos}
2625   ("to miss"), since this form will be the one searched in the
2626   bilingual dictionary to find its translation.  This means that the
2627   invariable lemma tail (\emph{de menos}) has to be moved after the
2628   uninflected lemma head (\emph{echar}). This moving backwards will be
2629   done by the auxiliary module \texttt{pretransfer} (see section
2630   \ref{se:pretransfer}) which runs before the structural transfer
2631   module.
2632
2633   To understand the example in Figure \ref{fig:echardemenos}, you have
2634   to be aware that the paradigm defining the verb \emph{echar}
2635   includes, besides the verb inflection, the enclitic pronouns that
2636   can appear at the end of the inflected forms of the verb; in the
2637   output lexical multiform, this enclitic pronouns are joined using
2638   the empty element \texttt{<\textbf{j}/>}.
2639
2640
2641
2642 \begin{figure}
2643 \begin{small}
2644 \begin{alltt}
2645 <\textbf{e} \textsl{lm}="echar de menos">
2646   <\textbf{i}>ech</\textbf{i}>
2647   <\textbf{par} \textsl{n}="aspir/ar__vblex"/> <!-it includes enclitic pronouns -->
2648   <\textbf{p}>
2649     <\textbf{l}><\textbf{b}/>de<\textbf{b}/>menos</\textbf{l}>
2650     <\textbf{r}><\textbf{g}><\textbf{b}/>de<\textbf{b}/>menos</\textbf{g}></\textbf{r}>
2651   </\textbf{p}>
2652 </\textbf{e}>
2653 \end{alltt}
2654 \end{small}
2655 \caption{A morphological dictionary entry containing a
2656   \texttt{<\textbf{g}>} group.}
2657 \label{fig:echardemenos}
2658 \label{fig:hacertilin}
2659 \end{figure}
2660
2661
2662
2663 When the translation is also a \emph{split lemma} (for example, the
2664 translation of "to miss" in Catalan is \emph{trobar a faltar}, with
2665 forms like \emph{trobem a faltar}, \emph{trobar-lo a faltar}, etc.),
2666 it is necessary to place again the lemma tail in its original place,
2667 after the inflected form plus the enclitic pronouns (if any), and
2668 indicate the correspondence of these invariable parts of the lemma
2669 (\emph{de menos}, \emph{a faltar}) at both sides of the
2670 translation. So, in the example of Figure ~\ref{fig:echardemenos}, the
2671 \texttt{<\textbf{g}>} element is used to mark the group
2672 `\texttt{<b/>de<b/>menos}' in the morphological dictionary, whereas in
2673 the bilingual dictionary (see Figure~\ref{fig:menosfaltar}), the
2674 \texttt{<\textbf{g}>} element is used to establish the correspondence
2675 between the groups ``\texttt{<b/>de<b/>menos}'' and
2676 ``\texttt{<b/>a<b/>faltar}''. \nota{I com serà el cas de ``dirección
2677 general'' - ``direcciones generales''?}
2678
2679 If the translation is not a \emph{split lemma}, you do not need to
2680 insert any \texttt{<\textbf{g}>} element in the target language
2681 string.
2682
2683 \end{enumerate}
2684
2685 \begin{figure}
2686 \begin{small}
2687 \begin{alltt}
2688 <\textbf{e}>
2689   <\textbf{p}>
2690     <\textbf{l}>echar<\textbf{g}><\textbf{b}/>de<\textbf{b}/>menos</\textbf{g}><\textbf{s} \textsl{n}="vblex"/></\textbf{l}>
2691     <\textbf{r}>trobar<\textbf{g}><\textbf{b}/>a<\textbf{b}/>faltar</\textbf{g}><\textbf{s} \textsl{n}="vblex"/></\textbf{r}>
2692   </\textbf{p}>
2693 </\textbf{e}>
2694 \end{alltt}
2695 \end{small}
2696 \caption{A bilingual dictionary entry containing two corresponding
2697 \texttt{<\textbf{g}>} groups.}
2698 \label{fig:menosfaltar}
2699 \end{figure}
2700
2701 \subsubsection{Metaparadigms}
2702 \label{ss:metaparadigmas}
2703
2704
2705 \nota{Marco diu: Especificar la DTD?}
2706
2707
2708 When developing the dictionaries for the Occitan translator, we were
2709 faced with a new need: we wanted to be able to specify paradigms for
2710 verbs that had a same inflection pattern but whose root changed in the
2711 different inflected forms.  With the existing paradigm system, a new
2712 paradigm had to be created for each of these verbs, since it was only
2713 possible to specify an inflection regularity pattern for a group of
2714 verbs with invariable root. With metaparadigms, it is possible to
2715 specify the inflection regularity as well as verb root variations.
2716
2717 At the same time, metaparadigms allow the specification, in a single
2718 paradigm, of variations in the grammatical symbols of a lemma.  That
2719 is, several lemmas can refer to a same metaparadigm even if they have
2720 different grammatical symbols. Whereas for Occitan, metaparadigms have
2721 allowed having a same paradigm for entries with root variations, for
2722 English, these have allowed having a same paradigm for entries with
2723 variations in their grammatical symbols.
2724
2725
2726 Related with this, we created the concept of metadictionary: it is a
2727 dictionary which contains metaparadigms as well as the normal
2728 paradigms used so far. The name of a metadictionary is
2729 \texttt{apertium-PAIR.}$L_1$\texttt{.metadix}
2730 (for example, for the English monolingual dictionary in the
2731 Apertium-en-ca system, \texttt{apertium-en-ca.en.metadix}).  When
2732 linguistic data are compiled these dictionaries are pre-processed, so
2733 that they have the appropriate format for the dictionary compiler.
2734
2735 \paragraph{Specification of metaparadigms}
2736
2737 Metaparadigms are defined in the \texttt{<\textbf{pardefs}>} section
2738 of the monolingual dictionary, the same section where also the rest of
2739 the dictionary paradigms are defined. A metaparadigm, just like a
2740 paradigm, has a name specified in the attribute \texttt{n}.  This name
2741 will have the same characteristics as in the other paradigms, with the
2742 difference that the variable part of the lemma root will be in brackets and
2743 in capital letters, as you can see in this example:
2744
2745 \begin{alltt}
2746 <\textbf{pardef} n="m/é[T]er\_\_vblex">
2747 \end{alltt}
2748
2749 This is the definition of a verb paradigm, where the inflection
2750 endings have a variable part in the root.  The inflection paradigms
2751 specified inside this metaparadigm have to present inflection only in
2752 the part at the right of the brackets, for example like the one
2753 specified in the paradigm:
2754
2755 \begin{alltt}
2756 <\textbf{par} n="mét/er\_\_vblex"/>
2757 \end{alltt}
2758
2759
2760 In conclusion, a complete example of metaparadigm definition would be:
2761
2762
2763 \begin{alltt}
2764 <\textbf{pardef} n="m/é[T]er__vblex">
2765   <\textbf{e}>
2766     <\textbf{p}>
2767       <\textbf{l}>e</\textbf{l}>
2768       <\textbf{r}>é</\textbf{r}>
2769     </\textbf{p}>
2770     <\textbf{i}><prm/></\textbf{i}>
2771     <\textbf{par} n="sent/eria__vblex"/>
2772   </\textbf{e}>
2773   <\textbf{e}>
2774     <\textbf{i}>é<prm/></\textbf{i}>
2775     <\textbf{par} n="mét/er__vblex"/>
2776   </\textbf{e}>
2777 </\textbf{pardef}>
2778
2779 \end{alltt}
2780
2781
2782 The tag \texttt{<\textbf{prm}/>} is the marker that is used to place
2783 the variable text part (the root variation) in the paradigm
2784 definition.
2785
2786
2787 Once a metaparadigm is defined, we may want that a verb uses it. To do
2788 so, in the verb entry (inside a \texttt{<\textbf{e}>} element) we must
2789 indicate the suitable metaparadigm and, through the attribute
2790 \texttt{prm}, define with which letters we want to replace the
2791 variable part specified in brackets. For example:
2792
2793 \begin{alltt}
2794 <\textbf{e} lm="acuélher">
2795   <\textbf{i}>acu</\textbf{i}>
2796   <\textbf{par} n="m/é[T]er__vblex" prm="lh"/>
2797 </\textbf{e}>
2798
2799 \end{alltt}
2800
2801 This entry defines the Occitan verb \emph{acuélher} ("to receive") and
2802 specifies that its inflection paradigm is the one defined by the
2803 metaparadigm \texttt{m/é[T]er\_\_vblex}, but replacing \texttt{T} with
2804 \texttt{lh}; that is, the letters following \emph{acu} will be
2805 \emph{élher} instead of \emph{éter}.
2806
2807
2808
2809 As mentioned before, metaparadigms can also be used for entries which
2810 have some variation in their grammatical symbols. The way to specify
2811 them is basically the same: the variable part must be specified in the
2812 entry with the attribute \texttt{sa}, whereas in the paradigm the tag
2813 \texttt{<\textbf{sa}>} has to be placed where the optional grammatical
2814 symbol should appear.
2815
2816 For example, we have the following metaparadigm:
2817
2818 \begin{alltt}
2819 <\textbf{pardef} n="house__n">
2820   <\textbf{e}>
2821     <\textbf{p}>
2822       <\textbf{l}/>
2823       <\textbf{r}><\textbf{s} n="n"/><sa/><\textbf{s} n="sg"/></\textbf{r}>
2824     </\textbf{p}>
2825   </\textbf{e}>
2826   <\textbf{e}>
2827     <\textbf{p}>
2828       <\textbf{l}>s</\textbf{l}>
2829       <\textbf{r}><\textbf{s} n="n"/><sa/><\textbf{s} n="pl"/></r>
2830     </\textbf{p}>
2831   </\textbf{e}>
2832 </\textbf{pardef}>
2833
2834 \end{alltt}
2835
2836
2837 and the following entry:
2838
2839 \begin{alltt}
2840 <\textbf{e} lm="time">
2841   <\textbf{i}>time</\textbf{i}>
2842   <\textbf{par} n="house__n" sa="unc"/>
2843 </\textbf{e}>
2844 \end{alltt}
2845
2846 where \emph{unc} means that the noun is uncountable.
2847
2848 In the metaparadigm, the tag \texttt{<\textbf{sa}>} shows the place
2849 where the grammatical symbol is to be placed if an entry contains the
2850 attribute \texttt{sa} with a value, as happens in the entry for
2851 \emph{time}.
2852
2853
2854 A dictionary which contains entries like the ones described here is
2855 called metadictionary and must be pre-processed in order to generate a
2856 dictionary that follows the DTD for Apertium 2, since the engine does
2857 not allow the direct use of metaparadigms. The next section describes
2858 how is this pre-processing like.
2859
2860
2861
2862
2863 \paragraph{Pre-processing of the metadictionary}
2864
2865
2866 A metadictionary is an XML file to which two XSLT style sheets are
2867 applied, in order to pre-process the metaparadigms and obtain a
2868 dictionary with all the paradigms derived from the metaparadigms.  The
2869 first style sheet, \texttt{buscaPar.xsl}, produces the list of verbs
2870 that use metaparadigms and deletes the possible repetitions of
2871 metaparadigms to be expanded. This style sheet generates, in
2872 combination with the sheet \texttt{principal.xsl}, a second style
2873 sheet called \texttt{gen.xsl}, which processes the metadictionary with
2874 the list of metaparadigms to be expanded and generates a dictionary in
2875 Apertium 2 format. Basically, what this generated style sheet does is:
2876
2877 \begin{enumerate}
2878
2879
2880 \item In verb entries, if a verb uses a metaparadigm, this
2881 metaparadigm is replaced by the corresponding expanded and
2882 deparametrized paradigm. Thus, the previous example entry:
2883
2884 \begin{alltt}
2885 <\textbf{e} lm="acuélher">
2886   <\textbf{i}>acu</\textbf{i}>
2887   <\textbf{par} n="m/é[T]er__vblex" prm="lh"/>
2888 </\textbf{e}>
2889 \end{alltt}
2890
2891         would be deparametrized and expanded into:
2892
2893 \begin{alltt}
2894 <\textbf{e} lm="acuélher">
2895   <\textbf{i}>acu</\textbf{i}>
2896   <\textbf{par} n="m/élher__vblex"/>
2897 </\textbf{e}>
2898 \end{alltt}
2899
2900
2901 \item On the other hand, since from the first pass the system knows
2902 which paradigms have to be created from metaparadigms, these are
2903 created.  In the previous example, from the metaparadigm:
2904
2905 \begin{alltt}
2906 <\textbf{pardef} n="m/é[T]er__vblex">
2907   <\textbf{e}>
2908     <\textbf{p}>
2909       <\textbf{l}>e</\textbf{l}>
2910       <\textbf{r}>é</\textbf{r}>
2911     </\textbf{p}>
2912     <\textbf{i}><prm/></\textbf{i}>
2913     <\textbf{par} n="sent/eria__vblex"/>
2914   </\textbf{e}>
2915   <\textbf{e}>
2916     <\textbf{i}>é<prm/></\textbf{i}>
2917     <\textbf{par} n="mét/er__vblex"/>
2918   </\textbf{e}>
2919 </\textbf{pardef}>
2920 \end{alltt}
2921
2922         the system would generate the paradigm
2923         \texttt{"m/élher\_\_vblex"} :
2924
2925 \begin{alltt}
2926 <\textbf{pardef} n="m/élher__vblex">
2927   <\textbf{e}>
2928     <\textbf{p}>
2929       <\textbf{l}>e</\textbf{l}>
2930       <\textbf{r}>é</\textbf{r}>
2931     </\textbf{p}>
2932     <\textbf{i}>lh/></\textbf{i}>
2933     <\textbf{par} n="sent/eria__vblex"/>
2934   </\textbf{e}>
2935   <\textbf{e}>
2936     <\textbf{i}>élh</\textbf{i}>
2937     <\textbf{par} n="mét/er__vblex"/>
2938   </\textbf{e}>
2939 </\textbf{pardef}>
2940 \end{alltt}
2941
2942 \end{enumerate}
2943
2944 After the metadictionary has been processed according to these steps,
2945 a .dix dictionary is generated which follows the DTD for Apertium 2
2946 and which can already be compiled.
2947
2948
2949 In the case of our second example, where the variable part was the
2950 sequence of grammatical symbols in the paradigm, the style sheets
2951 would be applied and, from the value \emph{unc} specified in the
2952 attribute \texttt{sa}, the following paradigm would be generated:
2953
2954 \begin{alltt}
2955 <\textbf{pardef} n="house__n__unc">
2956   <\textbf{e}>
2957     <\textbf{p}>
2958       <\textbf{l}/>
2959       <\textbf{r}><\textbf{s} n="n"/><\textbf{s} n="unc"/><\textbf{s} n="sg"/></\textbf{r}>
2960     </\textbf{p}>
2961  </\textbf{e}>
2962  <\textbf{e}>
2963    <\textbf{p}>
2964      <\textbf{l}>s</\textbf{l}>
2965      <\textbf{r}><\textbf{s} n="n"/><\textbf{s} n="unc"/><\textbf{s} n="pl"/></r>
2966    </\textbf{p}>
2967  </\textbf{e}>
2968 </\textbf{pardef}>
2969
2970 \end{alltt}
2971
2972 for nouns the morphological analysis of which should be (in data
2973 stream format):
2974
2975
2976 \begin{alltt}
2977 time<n><unc><sg>
2978 \end{alltt}
2979
2980 In this case, metaparadigms allows the use of the same paradigm for
2981 entries with the same inflection but with a slightly different
2982 morphological analysis.
2983
2984 It is important to note that, when a dictionary uses metaparadigms
2985 and, accordingly, its name has the extension \texttt{.metadix}, this
2986 will be the file where dictionary changes have to be made (adding,
2987 changing or deleting entries or paradigms), since the file
2988 \texttt{.dix} is automatically generated from this one every time
2989 linguistic data are compiled and, therefore, any changes made in the
2990 latter will be overwritten during compilation.
2991
2992
2993
2994 \subsection[Automatic generation of the modules]{Automatic generation
2995 of the lexical processing modules}
2996 \label{se:compiladoresdic}
2997
2998
2999 The four lexical processing modules (morphological analyser, lexical
3000 transfer, morphological generator and post-generator) are compiled
3001 from dictionaries by means of a single compiler based
3002 on letter transducers \cite{roche97}. This compiler is much faster
3003 than the ones used in the systems \textsf{interNOSTRUM}
3004 \cite{canals01b,garridoalenda01p,garrido99j} and \textsf{Traductor
3005 Universia} \cite{garrido03p, gilabert03j}, thanks to the use of new
3006 compiler building strategies and the minimization of partial
3007 transducers during the building process \cite{ortiz05j}.
3008
3009 The division of dictionary entries into lemma and paradigm enables the
3010 effective construction of minimal letter transducers. The compiler
3011 makes the most of the factorization allowed by paradigms in order to
3012 speed up the construction. Taking into account that, in most European
3013 languages, word variations occur at the end or the beginning of words,
3014 we took advantage of this fact to improve the construction speed of
3015 the minimal transducer.
3016
3017 Paradigms are also minimized before being inserted in the big
3018 transducer in order to reduce the size of the big transducer before
3019 its minimization.  Since, before minimizing, the paradigms of the
3020 dictionaries for the languages we have dealt with usually have just a
3021 few hundreds of states, the minimization of these paradigms is a very
3022 fast process.
3023
3024 If we assume that an entry can have at any point a reference to a
3025 paradigm, we could decide to copy at this point the transducer
3026 calculated in the paradigm definition. The method used in
3027 \emph{Apertium} is based on the idea that it is not always necessary
3028 to copy, because in certain cases it is possible to reuse a paradigm
3029 that was already copied.  In particular, two or more entries that
3030 share a paradigm as a suffix can reuse the same copy of this paradigm;
3031 the same can be said when it is as a prefix. However, generally it is
3032 not possible to reuse paradigms if they are located in intermediate
3033 positions of different entries, since new suffixes (or prefixes) can
3034 be added to existing entries, which causes the information inserted in
3035 the transducer not to be consistent with the dictionary, and therefore
3036 the generated transducer would be incorrect (it would add string pairs
3037 that are not present in the formal language defined by dictionaries).
3038
3039 Minimal letter transducers are built as explained next. From a string
3040 transduction it is possible to build a \textit{sequence of letter
3041 transductions} $S(s:t)$ with length $N = \max(|s|,|t|)$ which is
3042 defined as follows for each element $1 \leq i \leq N$:
3043
3044
3045 \begin{equation}
3046 \label{eq:transletras} S_i(s:t)=\left\{
3047 \begin{array}{ll}
3048 (s_i:\theta) & \textrm{if } i \leq |s| \wedge i > |t| \\
3049 (\theta:t_i) & \textrm{if } i \leq |t| \wedge i > |s| \\
3050 (s_i:t_i) & \textrm{in other cases}
3051 \end{array}\right.
3052 \label{e:montaje}
3053 \end{equation}
3054
3055 It should be emphasized that the construction design forbids the
3056 existence of a $(s:t)$ that is equal to $(\epsilon:\epsilon)$, which
3057 is crucial for the consistence of the building method.
3058
3059 The building method uses two procedures: the \textit{assembly}
3060 procedure inferred from equation \ref{e:montaje}, and the minimization
3061 procedure, which is executed by a conventional minimization algorithm
3062 \cite{vandesnepscheut93b} for deterministic finite state automata,
3063 which consists of inverting, determining, inverting again and
3064 determining again, taking as the alphabet of the automaton to be
3065 minimized the Cartesian product of $L$ and as empty transition the
3066 $\left(\theta:\theta\right)$.
3067
3068
3069 \begin{figure}
3070 \begin{center}
3071 \includegraphics[width=10cm]{fig1}
3072 \end{center}
3073 \caption{Building of the dictionary as prefix acceptor and link to
3074 paradigms through transitions $\left(\theta:\theta\right)$.}
3075 \label{fig:construccion}
3076 \end{figure}
3077
3078 \begin{figure}
3079 \begin{center}
3080 \includegraphics[width=8cm]{fig2}
3081 \end{center}
3082 \caption{Minimized paradigm "-es \textbf{n m}" used in Figure
3083 \ref{fig:construccion}.}
3084 \label{fig:paradigmapan}
3085 \end{figure}
3086
3087 \begin{figure}
3088 \begin{center}
3089 \includegraphics[width=8cm]{fig3}
3090 \end{center}
3091 \caption{Minimized paradigm "z/-ces \textbf{n m}" used in Figure
3092 \ref{fig:construccion}.}
3093 \label{fig:paradigmavez}
3094 \end{figure}
3095
3096
3097
3098 Figure \ref{fig:construccion} shows a simplified example of the
3099 assembly process.  Transductions, composed as in the equation
3100 \ref{e:montaje}, are inserted one by one in a transducer in the form
3101 of a \textit{prefix acceptor} or \textit{trie}, that is, in a way that
3102 there is only one node for each common prefix of the group of
3103 transductions that form the dictionary.  With the suffixes of the
3104 transductions (that are not shared) new states are created.  In the
3105 point where there is a reference to a paradigm, a replica of this
3106 paradigm is created and a link is created to the dictionary entry
3107 which is being inserted in the transducer by means of a null
3108 transduction $\left(\theta:\theta\right)$.
3109
3110 Each paradigm, as it can be seen as a little dictionary, has been
3111 built according to this same procedure and been minimized to reduce
3112 the size of the content when building the big dictionary. In Figures
3113 \ref{fig:paradigmapan} and \ref{fig:paradigmavez} you can see the
3114 state of the paradigms used in Figure \ref{fig:construccion} after its
3115 minimization.
3116
3117
3118
3119 \section{Part-of-speech tagger}
3120 \label{ss:tagger}
3121
3122 \subsection{Module description }
3123 \label{functagger}
3124
3125
3126 The part-of-speech tagger is based on first-order hidden Markov
3127 models~\cite{rabiner89}, that is, on statistical data. The states of
3128 the Markov model represent parts of speech, and the observable
3129 parameters are ambiguity classes~\cite{cutting92a}, formed by groups
3130 of parts of speech.
3131
3132 In spite of working with statistical information, the training and
3133 behaviour of the tagger improve with the application of restrictions
3134 that forbid certain sequences of parts of speech (in the first-order
3135 models, these sequences can only include two parts of speech). For
3136 example, in Spanish or Catalan a preposition can never be followed by
3137 a verb in personal form; this restriction is of great help when the
3138 word after a preposition is ambiguous and one of its possible analyses
3139 is a verb in personal form (e.g., \emph{de trabajo}, \emph{en
3140 libertad}, etc.).  Restrictions are explicitly declared in the tagger
3141 definition file, sometimes in the form of \emph{prohibitions} and
3142 sometimes of \emph{obligations}.
3143
3144 The morphological tags which the tagger works with are not the same as
3145 the ones used in the morphological analyser. Usually, the information
3146 delivered by the analyser is too detailed for the purposes of the
3147 part-of-speech disambiguation (for example, for most purposes, it
3148 suffices to group in the same category all common nouns, regardless of
3149 their gender and number). The use of finer-grained tags does not improve the
3150 results, whereas it increases the number of parameters to be estimated
3151 and intensifies the problem of lack of linguistic resources such as
3152 manually disambiguated texts. For this reason, in the tagger file one
3153 has to specify how to group the \emph{fine-grained} tags delivered by the
3154 morphological analyser into more general \emph{coarse} tags ---which
3155 we will call \emph{categories}--- that will be used in the
3156 part-of-speech disambiguation. Apart from coarse categories, one can
3157 also define lexicalized tags. Basically there are two types of
3158 lexicalizations described in bibliography: one type adds new
3159 observables and the other one, in addition, adds new states to the
3160 Markov model~\cite{pla04}; the tagger in Apertium uses the latter
3161 lexicalization type.
3162
3163 It is important to note that, in spite of working with \emph{coarse}
3164 categories, the tagger outputs fine-grained tags like the ones from the
3165 morphological analyser. Sometimes it may occur that the morphological
3166 analyser delivers, for a certain word, two or more fine-grained tags that can
3167 be grouped under the same tagger category: e.g. in Spanish
3168 \emph{cante} can be the 1st or the 3rd singular person of the
3169 subjunctive present of the verb \emph{cantar} ("to sing"); both fine-grained
3170 tags, \texttt{\emph{<vblex><prs><p1><sg>}} and
3171 \texttt{\emph{<vblex><prs><p3><sg>}}, are grouped under the tagger
3172 category \ \texttt{VLEXSUBJ} (\emph{subjunctive verb}). In this case,
3173 one of both fine tags is discarded; in the tagger definition file it
3174 is possible to define which fine-grained tag, among the ones that compose a
3175 coarse tag, will be delivered after disambiguation.
3176
3177
3178
3179
3180 \subsection{Data for the part-of-speech tagger}
3181 \label{datostagger}
3182 \subsubsection{Introduction}
3183 \label{ss:introtagger} We describe next the format of the files that
3184 specify how to group the \emph{fine-grained} tags delivered by the
3185 morphological analyser into more general \emph{coarse} tags.  In this
3186 files, moreover, one can specify \emph{restrictions} that help in the
3187 estimation of the statistical model underlying the process of lexical
3188 disambiguation, as well as preference rules to be applied when two
3189 fine-grained tags belong to the same category.
3190
3191
3192 The tagger assumes that, in the input stream, lexical forms will be
3193 appropriately delimited, as described in the format specification for
3194 the data stream between modules (Section \ref{se:flujodatos}). In
3195 brief, the format of the data delivered by the morphological analyser
3196 is the following:
3197 \begin{equation}
3198 \label{eq:formaanalizada}
3199   \begin{array}{rcl}
3200   \mbox{analysedform}&\to& \mbox{lexicalmultiform}\;
3201   [\; \mbox{lexicalmultiform} \; ]^*
3202 \\
3203   \mbox{lexicalmultiform}&\to& \mbox{lexicalform}\; [\;\mbox{lexicalform}\; ]^*\;\mbox{lemma-queue?} \\
3204   \mbox{lexicalform}&\to&\mbox{lemma}\;\mbox{finetag}\\
3205   \mbox{lemma-queue}&\to&\mbox{lemma}\\
3206   \mbox{finetag}&\to&\mbox{morphsymbol}\;[\;\mbox{morphsymbol}\;]^* \\
3207   \end{array}
3208 \end{equation}
3209 \label{formaanalizada}
3210
3211 where:
3212
3213
3214 \begin{itemize}
3215 \item \emph{analysedform} is all the information delivered for each
3216 surface form in the output of the morphological analyser
3217 \item \emph{lexicalmultiform} is a sequence of one or more lexical
3218 forms followed, optionally, by an invariable queue as happens in some
3219 multiwords (like the Spanish expression \emph{cántale las cuarenta}).
3220 \item \emph{lexicalforms}\footnote{Separated from each other by a
3221 delimiter which corresponds to the \texttt{<j/>} element (see page
3222 \pageref{ss:j}).} are units made of one lemma and one or more
3223 grammatical symbols (which compose the fine-grained tag) with the output
3224 information of the analyser
3225 \item \emph{lemma-queue} is made of one or more lemmas
3226   \footnote{Separated from each other by the \texttt{<b/>} element
3227   (see page~\pageref{s3:b}).} that are the invariable part of a
3228   multiword. The queue of a multiword is made of the lemma or lemmas
3229   with no inflection that follow the lemmas with inflection. For
3230   example, the Spanish multiword \emph{cantar las cuarenta} ("to
3231   lecture", "to reproach") can take the forms \emph{cántale las
3232   cuarenta}, \emph{(le) cantaré las cuarenta}, \emph{cantándole las
3233   cuarenta}, etc. In this case, the queue would be \emph{las cuarenta}
3234   (see page~\pageref{ss:multipalabras} for more information).
3235
3236 \item \emph{finetag} is made of one or more grammatical symbols
3237 (\emph{símbologram}).
3238 \end{itemize}
3239
3240 For example, the entry for the Spanish ambiguous surface form
3241 \emph{correos} would have two lexical multiforms; the first lexical
3242 multiform would have one single lexical form, with lemma \emph{correo}
3243 ("post office") and a fine tag made of the grammatical symbols
3244 \emph{common noun}, \emph{masculine}, \emph{plural}; the second
3245 lexical multiform would be a sequence of two lexical forms, one with
3246 lemma \emph{correr} ("to move") and a fine tag made of the grammatical
3247 symbols \emph{lexical verb}, \emph{imperative}, \emph{second person},
3248 \emph{plural}, and the other one with lemma \emph{vosotros} ("you")
3249 and fine tag made of the grammatical symbols \emph{pronoun},
3250 \emph{enclitic}, \emph{second person}, \emph{masculine-feminine},
3251 \emph{plural}.
3252
3253 \notavisible{An  explanation of how a word containing more than one
3254   lexical form is treated when no multilexical form is defined for it
3255   should be added}
3256
3257 \subsubsection{Format specification}
3258 \label{formatotagger} The format of the file (encoded in XML) is
3259 specified by the DTD that can be found in
3260 Appendix~\ref{ss:DTD_desambiguador}.
3261
3262
3263 The meaning of the different tags is the following:
3264 \begin{description}
3265 \item[\texttt{tagger}]: is the root element; its mandatory attribute
3266 \texttt{name} is used to specify the name of the tagger generated from
3267 the file.
3268 \item[\texttt{tagset}]: defines the \emph{coarse} tagset or categories
3269 with which the tagger works. Categories are defined by the fine-grained tags
3270 output by the morphological analyser.
3271 \item[\texttt{def-label}]: defines a category or coarse tag (whose
3272   name is specified in the mandatory attribute \texttt{name}) by means
3273   of a list of fine tags defined with one or more \texttt{tags-item}
3274   elements; an optional attribute \texttt{closed} indicates whether
3275   this is a closed category; if this is the case, it is assumed that
3276   an unknown word can never belong to this category.\footnote{Closed
3277   categories are those that do not grow when new words are created:
3278   prepositions, determiners, conjunctions, etc.}
3279
3280   The more specific categories \emph{must} be defined before the more
3281   general ones.  When the definition of a general category implicitly
3282   includes that of a specific category defined before, it is
3283   understood that it refers to all cases \emph{except} the ones
3284   defined by the more specific category.
3285
3286 \item[\texttt{tags-item}]: is used to define a fine-grained tag by means of a
3287 sequence of grammatical symbols. The sequence of grammatical symbols
3288 that make up the fine tag is specified in the mandatory attribute
3289 \texttt{tags}. In this sequence, symbols are separated by a dot, and
3290 the asterisk ``\texttt{*}'' is used to express that any sequence of
3291 symbols may appear in its place. It is also possible to define
3292 lexicalized categories, specifying the lemma of the word in the
3293 attribute \texttt{lemma}.
3294
3295 \item[\texttt{def-mult}]: defines special categories
3296 (\emph{multicategories}) made of more than one category, in order to
3297 deal with entries with more than one lexical form, like in the example
3298 given in the previous section. Each category is defined as a set of
3299 valid sequences (\texttt{sequence}) of previously defined categories
3300 or of fine-grained tags.  It is designed for contractions, verbs with enclitic
3301 pronouns, etc.
3302
3303 \item[\texttt{sequence}]: defines a sequence of elements, which can be
3304 categories (\texttt{label-item}) or fine-grained tags
3305 (\texttt{tags-item}). Using fine-grained tags directly is useful if one wishes
3306 to use a sequence of grammatical symbols that is not part of any
3307 previously defined fine tag \nota{MG: en comptes de 'fine tag' no es
3308 refereix aquí a 'category'?} or that represents a greater
3309 specialization of a defined fine tag \nota{ídem: category}.
3310
3311 \item[\texttt{label-item}]: is used to refer to a category or coarse
3312 tag previously defined, to be specified in the mandatory attribute
3313 \texttt{label}.
3314
3315 \item[\texttt{forbid}]: this (optional) section is aimed to define
3316 restrictions as sequences of categories \texttt{label-sequence} that
3317 can not occur in the language involved. In the current version, due to
3318 the fact that the tagger is based on first-order hidden Markov models,
3319 sequences can only be made of \emph{two} \texttt{label-items}.
3320
3321 \item[\texttt{label-sequence}]: defines a sequence of categories
3322 (\texttt{label-item}).
3323
3324 \item[\texttt{enforce-rules}]: this (optional) section allows defining
3325 restrictions in the form of obligations.
3326
3327 \item[\texttt{enforce-after}]: defines a restriction that forces that
3328 a certain category can only be followed by the categories belonging to
3329 the set of categories defined in \texttt{label-set}. Note that this
3330 kind of restrictions is equivalent to defining several forbidden
3331 (\texttt{forbid}) sequences (\texttt{label-sequence}) with the
3332 category defined in the mandatory attribute \texttt{label} and the
3333 rest of categories that do not belong to the set defined in
3334 \texttt{label-set}. For this reason, this kind of restriction must be
3335 used very cautiously.
3336
3337 \item[\texttt{label-set}]: defines a set of categories
3338 (\texttt{label-items}).
3339
3340 \item[\texttt{preferences}]: used to define priorities in terms of
3341 which fine-grained tag must be delivered in the tagger output when two or more
3342 fine tags are assigned to the same category.
3343
3344 \item[\texttt{prefer}]: specifies that, in case of conflict between
3345 different fine-grained tags assigned to the same category, the tagger must
3346 output the tag specified in the mandatory attribute \texttt{tags}. If
3347 a category contains more than one of the fine tags included in these
3348 \texttt{prefer} elements, the tag defined in the first place will be
3349 the selected one.
3350 \end{description}
3351
3352 Figures~\ref{fg:exemple_desambiguador1}
3353 and~\ref{fg:exemple_desambiguador2} contain an example with the most
3354 significant parts of a tagger specification file defined by the DTD
3355 just described.
3356
3357 % DTD moguda a Apèndix
3358
3359
3360 \begin{figure}[htbp]
3361   \begin{small}
3362     \begin{alltt}
3363 <?\textsl{xml} \textsl{version}="1.0" \textsl{encoding}="iso-8859-1"?>
3364 <!\textsl{DOCTYPE} \textbf{tagger} SYSTEM "tagger.dtd">
3365 <\textbf{tagger} \emph{name}="es-ca">
3366 <\textbf{tagset}>
3367    <\textbf{def-label} \textsl{name}="adv">
3368       <\textbf{tags-item} \textsl{tags}="adv"/>
3369    </\textbf{def-label}>
3370    <\textbf{def-label} \textsl{name}="detnt" \textsl{closed}="true">
3371       <\textbf{tags-item} \textsl{tags}="detnt"/>
3372    </\textbf{def-label}>
3373    <\textbf{def-label} \textsl{name}="detm" \textsl{closed}="true">
3374       <\textbf{tags-item} \textsl{tags}="det.*.m"/>
3375    </\textbf{def-label}>
3376    <\textbf{def-label} \textsl{name}="vlexpfci">
3377       <\textbf{tags-item} \textsl{tags}="vblex.pri"/>
3378       <\textbf{tags-item} \textsl{tags}="vblex.fti"/>
3379       <\textbf{tags-item} \textsl{tags}="vblex.cni"/>
3380    </\textbf{def-label}>
3381    <\textbf{def-mult} \textsl{name}="infserprnenc" \textsl{closed}="true">
3382       <\textbf{sequence}>
3383          <\textbf{label-item} \textsl{label}="vserinf"/>
3384          <\textbf{label-item} \textsl{label}="prnenc"/>
3385       </\textbf{sequence}>
3386       <\textbf{sequence}>
3387          <\textbf{label-item} \textsl{label}="vserinf"/>
3388          <\textbf{label-item} \textsl{label}="prnenc"/>
3389          <\textbf{label-item} \textsl{label}="prnenc"/>
3390       </\textbf{sequence}>
3391    </\textbf{def-mult}>
3392    <\textbf{def-mult} \textsl{name}="prepdet" \textsl{closed}="true">
3393       <\textbf{sequence}>
3394          <\textbf{label-item} \textsl{label}="prep"/>
3395          <\textbf{tags-item} \textsl{tags}="det.def.m.sg"/>
3396       </\textbf{sequence}>
3397    </\textbf{def-mult}>
3398 </\textbf{tagset}>
3399 <!-- ... -->
3400     \end{alltt}
3401   \end{small}
3402   \caption{Example of a tagger definition file (continues in
3403   Figure~\ref{fg:exemple_desambiguador2}).}
3404   \label{fg:exemple_desambiguador1}
3405 \end{figure}
3406
3407
3408 \begin{figure}[htbp]
3409   \begin{small}
3410     \begin{alltt}
3411 <!-- ... -->
3412 <\textbf{forbid}>
3413    <\textbf{label-sequence}>
3414       <\textbf{label-item} \textsl{label}=="prep"/>
3415       <\textbf{label-item} \textsl{label}=="vlexpfci"/>
3416    </\textbf{label-sequence}>
3417    <!-- ... -->
3418 </\textbf{forbid}>
3419 <\textbf{enforce-rules}>
3420    <\textbf{enforce-after} \textsl{label}=="prnpro">
3421       <\textbf{label-set}>
3422          <\textbf{label-item} \textsl{label}=="prnpro"/>
3423          <\textbf{label-item} \textsl{label}=="vlexpfci"/>
3424          <!-- ... -->
3425       </\textbf{label-set}>
3426    </\textbf{enforce-after}>
3427    <!-- ... -->
3428 </\textbf{enforce-rules}>
3429 <\textbf{preferences}>
3430    <\textbf{prefer} \textsl{tags}="vblex.pii.p3.sg"/>
3431    <\textbf{prefer} \textsl{tags}="vbser.pii.p3.sg"/>
3432    <!-- ... -->
3433 </\textbf{preferences}>
3434 </\textbf{tagger}>
3435     \end{alltt}
3436   \end{small}
3437   \caption{Example of a tagger definition file (comes from
3438   Figure~\ref{fg:exemple_desambiguador1}).}
3439   \label{fg:exemple_desambiguador2}
3440 \end{figure}
3441
3442 \subsection{Some questions about the training of the part-of-speech
3443 tagger} The training of the part-of-speech tagger can be made both in
3444 a supervised manner, using manually disambiguated texts, and a
3445 unsupervised manner, using ambiguous texts.
3446
3447 When the training is made with ambiguous texts (unsupervised), the
3448 format of the required text can be automatically obtained from a plain
3449 text corpus in the chosen language using the system's morphological
3450 analyser; in this case, the format of the text forms will be like the
3451 one defined in the figure~\ref{eq:formaanalizada2} (its description
3452 can be found in page~\pageref{formaanalizada}). As the chart shows,
3453 each analysed surface form can have more than one analysis (an
3454 \emph{analysedform} can give as a result more than one
3455 \emph{lexicalmultiform}).
3456
3457
3458 \begin{equation}
3459 \label{eq:formaanalizada2}
3460   \begin{array}{rcl} \mbox{analysedform}&\to&
3461 \mbox{lexicalmultiform}\; [\; \mbox{lexicalmultiform} \; ]^* \\
3462 \mbox{lexicalmultiform}&\to& \mbox{lexicalform}\;
3463 [\;\mbox{lexicalform}\; ]^*\;\mbox{lemma-queue?} \\
3464 \mbox{lexicalform}&\to&\mbox{lemma}\;\mbox{finetag}\\
3465 \mbox{lemma-queue}&\to&\mbox{lemma}\\
3466 \mbox{finetag}&\to&\mbox{morphsymbol}\;[\;\mbox{morphsymbol}\;]^* \\
3467   \end{array}
3468 \end{equation}
3469 \label{formaanalizada2}
3470
3471 For the supervised training we need manually disambiguated text. The
3472 format of the text forms in this case will be like the format
3473 delivered by the morphological analyser (see
3474 Section~\ref{se:flujodatos}) except that, being the text already
3475 disambiguated, a surface form can never produce more than one lexical
3476 form, as shown in Figure~\ref{eq:formadesambiguada} (a
3477 \emph{disambiguatedform} will consist always of a single
3478 \emph{lexicalmultiform}).
3479 \begin{equation}
3480 \label{eq:formadesambiguada}
3481   \begin{array}{rcl}
3482   \mbox{disambiguatedform}&\to&\mbox{lexicalmultiform}\\
3483   \mbox{lexicalmultiform}&\to&\mbox{lexicalform}\;[\;\mbox{lexicalform}\;]^*\;\mbox{lemma-queue?}\\
3484   \mbox{lexicalform}&\to&\mbox{lemma}\;\mbox{finetag}\\
3485   \mbox{lemma-queue}&\to&\mbox{lemma}\\
3486   \mbox{finetag}&\to&\mbox{morphsymbol}\;[\;\mbox{morphsymbol}\;]^* \\
3487   \end{array}
3488 \end{equation}
3489
3490
3491 Finally, we need also the dictionary of the involved language to train
3492 the tagger. This dictionary is used to determine, in combination with
3493 the tagset specification, the different ambiguity classes with which
3494 the tagger will work.
3495
3496 Figure \ref{fig:dependencias} shows the dependency diagram for the
3497 training and the use of the tagger.
3498
3499 \nota{Aquest esquema canviarà amb el nou tagger - Sergio}
3500
3501 \begin{figure}
3502 \begin{center}
3503 \includegraphics[width=15cm]{diagram}
3504 \end{center}
3505 \caption{Dependency diagram for the part-of-speech tagger.}
3506 \label{fig:dependencias}
3507 \end{figure}
3508
3509
3510 \newpage
3511
3512 \section[Transfer pre-processing]{Auxiliary module: transfer
3513 pre-processing module}
3514 \label{se:pretransfer}
3515 \subsection{Justification} The transfer pre-processing module
3516 \texttt{pretransfer} is in charge of separating compound multiwords
3517 (see page~\pageref{ss:multipalabras}) and shifting certain parts of
3518 multiwords with inner inflection or \emph{split lemma} forms.  This
3519 module processes the tagger output and generates an entry suitable for
3520 the transfer module.  The processing performed by this module is
3521 necessary for different reasons:
3522
3523 \begin{itemize}
3524 \item So that the transfer module can process these units separately
3525 in order to deal with, for example, the movement of clitic pronouns
3526 when changing from enclitic to proclitic and vice versa.
3527 \item So that the bilingual dictionary only has to store information
3528 about the lemmas to be translated.  If the particles that make up a
3529 multiword are included jointly in the bilingual dictionary, the
3530 dictionary would have to store an entry for each of the different
3531 combinations.  By separating compound multiwords and processing multiwords with
3532 inner inflection, we can avoid having
3533 entries including inflection variations in the bilingual dictionary.
3534 \end{itemize}
3535
3536 \subsection{Behaviour and example}
3537
3538 The program replaces each \texttt{<j/>} in the dictionary, that is,
3539 each \texttt{+} in the data stream, by a symbol for word end, a blank
3540 and a symbol for word beginning.  Moreover, if the form is a multiword
3541 with split lemma, the queue is moved to the position between the first
3542 word of the multiword and its first grammatical symbol.
3543
3544 The task of generating an output which has the original order accepted
3545 by the generator, is left to the rules of the transfer
3546 module, which are also responsible for creating the compound
3547 multiwords which may be required in the target language.  In general,
3548 the generator works with the same multiwords as the morphological
3549 analyser, and with the elements in the same order; that is the reason
3550 why this task has to be done in the transfer module.
3551
3552 We show below the result of applying this process to the compound
3553 multiword \textit{darlo} ("give it" in Spanish):
3554
3555 \begin{small}
3556 \begin{alltt}
3557 \$ pretransfer
3558 ^dar<vblex><inf>+lo<prn><enc><p3><m><sg>\$     \(\longleftarrow\) \textrm{input}
3559 ^dar<vblex><inf>\$ ^lo<prn><enc><p3><m><sg>\$     \(\longleftarrow\) \textrm{output}
3560 \end{alltt}
3561 \end{small}
3562
3563 As can be seen, it consists only in dividing the lexical forms of a
3564 compound multiword into individual lexical forms.
3565
3566 When the input is a multiword with split lemma, the process is as
3567 shown in the following example for the Spanish multiword
3568 \textit{echarte de menos} ("to miss you"):
3569
3570 \begin{small}
3571 \begin{alltt}
3572 \$ pretransfer
3573 ^echar<vblex><inf>+te<prn><enc><p2><m><sg># de menos\$
3574 ^echar# de menos<vblex><inf>\$ ^te<prn><enc><p2><m><sg>\$
3575 \end{alltt}
3576 \end{small}
3577
3578 Here, besides dividing into lexical forms, the module moves the
3579 invariable lemma queue into the mentioned position.  As you can see,
3580 semantic units are maintained after the movement of the invariable
3581 queue, since we can consider \textit{echar de menos} a verbal unit
3582 with own meaning.
3583
3584
3585
3586
3587 \section{Lexical selection module}
3588 \label{se:seleccio_lex}
3589
3590
3591 \subsection{Introduction}
3592
3593
3594 When the Apertium system is used to translate between less related
3595 languages than the ones dealt with in the first stages of the engine,
3596 the question of lexical selection becomes significant, because there
3597 are more cases, and more critical, in which a source language word can
3598 have more than one different translation in the target language. For
3599 this reason we created a new module, the lexical selection module,
3600 which deals with this problem.
3601
3602 Before going into its characteristics, we will see how the problems of
3603 \emph{multiple equivalence} (the fact of existing more than one
3604 possible translation in target language for a source language lexical
3605 form) are tackled in Apertium in two ways.
3606
3607 On the one hand, we have the situation where there is no big
3608 difference in meaning between the multiple equivalents in the target
3609 language, and the fact of choosing one or the other can not lead to
3610 any translation error. We could say that between these equivalents
3611 there is a synonymy or quasi-synonymy relation. In such a case, the
3612 linguist chooses one of the lemmas as a translation (generally the
3613 most frequent or usual), and adds a direction restriction to the other
3614 lemmas (with the attributes \texttt{LR} or \texttt{RL}) so that they
3615 are translated in the opposite direction but not in the direction
3616 where there are multiple equivalents.
3617
3618
3619 On the other hand, we have the case where there is a clear difference
3620 in meaning between the multiple equivalents, which can lead to
3621 translation errors if the inappropriate lemma is chosen. These are the
3622 cases dealt with the new lexical selection module. The linguist has to
3623 encode entries with the attributes \texttt{slr} or \texttt{srl}
3624 described in the next section, thus identifying the different
3625 translation options; then, the lexical selection module, by means of
3626 statistical methods, chooses the translation which is most suitable in
3627 a given context.
3628
3629
3630
3631 Sometimes it is not easy to decide whether a multiple equivalence
3632 situation should be solved in one way or the other. For example, if
3633 there is difference in the meaning of two or more lemmas in the target
3634 language, but we think that the lexical selection module will not be
3635 capable of choosing the right translation by means of the context, we
3636 will follow the first method: choose a fixed translation (the most
3637 general, the most suitable in the maximum number of situations) and
3638 add a direction restriction to the rest of translations.  In the other
3639 cases, we will encode the entries so that the decision is left to the
3640 lexical selection module.
3641
3642
3643 When we use an Apertium system without lexical selection module, the
3644 only way to add entries with different possible translations is the
3645 first one, that is, choosing an only translation and marking the other
3646 equivalences with a direction restriction.  In the event that we use
3647 bilingual dictionaries with multiple translations, encoded with the
3648 attributes \texttt{slr} or \texttt{srl}, in a system that does not
3649 have any lexical selection module, a style sheet will
3650 convert these entries designed for a lexical selection module into
3651 entries with direction restrictions \texttt{LR} or \texttt{RL}, so
3652 that one of the multiple equivalents (the one chosen as default entry
3653 by the linguist) becomes the fixed translation of the source language
3654 lemma.
3655
3656
3657
3658 As examples of bilingual equivalencies that should have a direction
3659 restriction, we can give the translation pairs \texttt{ca-es}
3660 \emph{encara -- aún/todavía} ("still") and \emph{sobtat --
3661 súbito/repentino} ("sudden"), the first one of which could be encoded
3662 like this:
3663 \begin{alltt}
3664 \begin{small}
3665
3666 <e r="LR">
3667    <p>
3668       <l>aún<s n="adv"/></l>
3669       <r>encara<s n="adv"/></r>
3670    </p>
3671 </e>
3672 <e>
3673     <p>
3674       <l>todavía<s n="adv"/></l>
3675       <r>encara<s n="adv"/></r>
3676     </p>
3677 </e>
3678 \end{small}
3679 \end{alltt}
3680
3681 As examples of the second case (multiple equivalents with big
3682 difference in meaning) we have the pairs \texttt{es-ca} \emph{hoja --
3683 full/fulla} ("sheet/leaf") and \emph{muñeca -- nina/canell}
3684 ("doll/wrist"), as well as the \texttt{en-ca} examples shown in page
3685 \pageref{entrades_lextor}, where it is described how to specify these
3686 multiple equivalents in the bilingual dictionary.
3687
3688
3689
3690
3691 \begin{figure} {\footnotesize \setlength{\tabcolsep}{0.5mm}
3692 \begin{center}
3693 \begin{tabular}{ccccccccc} \\
3694 \parbox{0.95cm}{source language text} \\ $\downarrow$ \\
3695 \framebox{\parbox{1.0cm}{de-for\-matter}} $\rightarrow$ &
3696 \framebox{\parbox{0.6cm}{morph. anal.}}  $\rightarrow$ &
3697 \framebox{\parbox{1.0cm}{POS tagger}} $\rightarrow$ &
3698 \framebox{\parbox{0.6cm}{lex. select.}} $\rightarrow$ &
3699 \framebox{\parbox{0.85cm}{struct. transf.}} $\rightarrow$ &
3700 \framebox{\parbox{0.6cm}{morph. gen.}}  $\rightarrow$ &
3701 \framebox{\parbox{1.2cm}{post\-generator}} $\rightarrow$ &
3702 \framebox{\parbox{1.0cm}{re-for\-matter}} \\ & & & & $\updownarrow$ &
3703 & & $\downarrow$ \\ & & & & \framebox{\parbox{0.8cm}{lex. transf.}} &
3704 & &
3705 \parbox{0.95cm}{target language text} \\
3706 \end{tabular}
3707 \end{center} }
3708 \caption{The nine modules that build the assembly line in the version
3709   2 of the machine translation system Apertium.}
3710 \label{fig:moduls}
3711 \end{figure}
3712
3713 Figure~\ref{fig:moduls} shows the new assembly line of the version 2
3714 of Apertium.\footnote{This figure substitutes the figure
3715 \ref{fg:modules} in page \pageref{pg:modules} which represents the
3716 version 1 of Apertium.} \nota{MG: caldria canviar la figura de la
3717 pàgina 6 per aquesta d'aquí?} The module in charge of the lexical
3718 selection (lexical selector) runs after the part-of-speech tagger and
3719 before the structural transfer module; therefore, this new module
3720 works only with source language information.
3721
3722
3723 Section~\ref{se:preprocessament} next describes the pre-processing
3724 that must be done on a bilingual dictionary  containing more than
3725 one translation per entry (whether the system uses a
3726 lexical selector or not), and Section~\ref{se:lextor} describes
3727 how the lexical selector works and how it has to be trained.
3728
3729
3730
3731 \subsection{Pre-processing of the bilingual dictionaries
3732 }\label{se:preprocessament}
3733
3734 Bilingual dictionaries have been modified to allow the specification
3735 of more than one translation per entry (refer to Section
3736 \ref{dic_lextor} to learn how to write such dictionary entries); this
3737 fact makes it necessary to pre-process these dictionaries, since the
3738 Apertium engine works with compiled dictionaries in which there is
3739 only one possible translation for each word.
3740
3741 The pre-processing of dictionaries is done automatically during
3742 compilation, therefore the final user does not need to perform any
3743 specific action.
3744
3745
3746 \subsubsection{Pre-processing without lexical selection module}
3747
3748 When bilingual dictionaries with multiple equivalents are used in a
3749 system where there is no lexical selection module, the pre-processing
3750 is done by the application of the style sheet
3751 \texttt{translate-to\--de\-fault\--e\-qui\-va\-lent.xsl}.  This style
3752 sheet turns dictionaries with multiple translations per entry into
3753 dictionaries with only one translation per entry; to do this, it
3754 chooses as translation the entry marked as default, and adds a
3755 direction restriction (\texttt{LR} or \texttt{RL} as applicable) to
3756 the other entries, so that they are only translated in the translation
3757 direction where there is no equivalent multiplicity.  The style sheet
3758 is called from the \texttt{Makefile}.
3759
3760
3761 To put an example, the result of applying the style sheet on the first
3762 three entries shown in page \pageref{entrades_lextor} is the
3763 following:
3764
3765 \begin{alltt}
3766 \begin{small}
3767 <e>
3768    <p>
3769       <l>flat<s n="n"/></l>
3770       <r>pis<s n="n"/><s n="m"/></r>
3771    </p>
3772 </e>
3773
3774 <e r="LR">
3775    <p>
3776       <l>floor<s n="n"/></l>
3777       <r>pis<s n="n"/><s n="m"/></r>
3778    </p>
3779 </e>
3780
3781 <e r="RL">
3782    <p>
3783       <l>floor<s n="n"/></l>
3784       <r>terra<s n="n"/><s n="m"/></r>
3785    </p>
3786 </e>
3787 \end{small}
3788 \end{alltt}
3789
3790 \subsubsection{Preprocessing with lexical selection module}
3791
3792 If the Apertium system works with a lexical selection module, the
3793 bilingual dictionary must be pre-processed in order to obtain:
3794 \begin{itemize}
3795 \item a monolingual dictionary that, for each source language word
3796 (for example \emph{look}) delivers all the possible translation marks
3797 or equivalents (\texttt{look\_\_mirar D} and
3798 \texttt{look\_\_semblar}); this dictionary will be used by the lexical
3799 selection module; and
3800
3801 \item a new bilingual dictionary that, given a word with the lexical
3802 selection already done (for example \texttt{look\_\_semblar}) delivers
3803 the translation (\emph{semblar}); this will be the bilingual
3804 dictionary to be used in the lexical transfer.
3805
3806 \end{itemize}
3807
3808
3809 This pre-processing is automatically done by means of the following
3810 software during dictionary compilation:
3811 \begin{itemize}
3812 \item \texttt{apertium-gen-lextormono}, that receives three
3813 parameters:
3814   \begin{itemize}
3815   \item the translation direction for which you want to generate the
3816   monolingual dictionary used in the lexical selection; \texttt{lr}
3817   for the translation left to right, and \texttt{rl} for the
3818   translation right to left;
3819   \item the monolingual dictionary to be pre-processed; and
3820   \item the file where the output monolingual dictionary has to be
3821   written.
3822   \end{itemize}
3823
3824 \item \texttt{apertium-gen-lextorbil}, that receives three parameters:
3825   \begin{itemize}
3826   \item the translation direction (\texttt{lr} or \texttt{rl}) for
3827     which you want to generate the bilingual dictionary to be used by
3828     the lexical transfer module;
3829   \item the bilingual dictionary to be pre-processed; and
3830   \item the file where the output bilingual dictionary has to be
3831   written.
3832   \end{itemize}
3833 \end{itemize}
3834
3835 \subsection{Execution of the lexical selection
3836 module}\label{se:lextor}
3837
3838 The module responsible for the lexical selection runs after the
3839 part-of-speech tagger and before the structural transfer (see
3840 Figure~\ref{fig:moduls} in page~\pageref{fig:moduls}); therefore, it
3841 uses only information from the source language. However, during the
3842 training of the module, target language information is also used.
3843
3844
3845 \subsubsection{Training}\label{se:entrenament}
3846
3847 To train the lexical selection module, a corpus in the source language
3848 and another one in the target language are required; they do not need
3849 to be related. Both corpora must be pre-processed before the
3850 training. This pre-processing, consisting in analysing the corpora and
3851 performing the POS disambiguation, can be done  with
3852 \texttt{apertium-prepro\-cess\--cor\-pus\--lex\-tor}.
3853
3854 The training of the module that performs the lexical selection
3855 consists of the following tasks:\footnote{The training of the models
3856 used for the lexical selection has been automated in all the packages
3857 using it. Furthermore, all the software mentioned has its UNIX manual
3858 page}
3859
3860
3861
3862 \begin{enumerate}
3863 \item Obtain the list of words that will be ignored when performing
3864 lexical selection (\emph{stopwords}). This list can be done manually
3865 or using \texttt{apertium-gen-stopwords-lextor};
3866 \item Obtain the list of (source language) words that have more than
3867 one translation in the target language, using
3868 \texttt{apertium-gen-wlist-lextor};
3869 \item Translate to the target language all the words obtained in the
3870 previous step, using \texttt{apertium-gen-wlist-lextor-translation};
3871 \item Running \texttt{apertium-lextor --trainwrd} and using the target
3872 language pre-processed corpus, train a word co-occurrence model for
3873 the words obtained in the previous step;
3874 \item Running \texttt{apertium-lextor --trainlch} and using the source
3875 language pre-processed corpus, the dictionaries generated by the
3876 programs mentioned in Section~\ref{se:preprocessament} and the word
3877 co-occurrence models calculated in the previous step, train a
3878 co-occurrence model for each of the translation marks of those words
3879 that can have more than one translation in the target language.
3880 \end{enumerate}
3881
3882 \subsubsection{Use}\label{se:us}
3883
3884 The word co-occurrence models
3885 calculated for each translation mark as described in the previous
3886 section provide the information required to perform lexical selection
3887 with information from the context.
3888
3889 Lexical selection is done by \texttt{apertium-lextor --lextor}; the
3890 formats used to communicate with the rest of the modules of the
3891 translation engine are:
3892
3893 \begin{description}
3894 \item [Input:] text in the same format as the input for the structural
3895 transfer module, that is, text analysed and disambiguated, with
3896 invariable queues of multiwords moved before morphological tags.
3897 \item [Output:] text in the same format, but with the translation mark
3898 to be used when executing lexical transfer.
3899 \end{description}
3900
3901
3902 The following example illustrates the input/output formats used by the
3903 lexical selector (we have assumed in the example that only the English
3904 verb \emph{get} has more than one translation equivalent in the
3905 dictionaries):
3906 \begin{itemize}
3907 \item Source language text (English): \emph{To get to the city centre}
3908 \item Lexical selector input: \verb!^To<pr>$!
3909 \verb!^get<vblex><inf>$! \verb!^to<pr>$! \verb!^the<det><def><sp>$!
3910 \verb!^city<n><sg>$! \verb!^centre<n><sg>$!
3911 \item Translation marks in the en-ca bilingual dictionary for the verb
3912 \emph{get}: \texttt{rebre}, \texttt{agafar}, \texttt{arribar},
3913 \texttt{aconseguir D}
3914 \item Lexical selector output: \verb!^To<pr>$!
3915 \verb!^get__arribar<vblex><inf>$! \verb!^to<pr>$!
3916 \verb!^the<det><def><sp>$!  \verb!^city<n><sg>$!
3917 \verb!^centre<n><sg>$!
3918 \end{itemize}
3919
3920
3921 \newpage
3922 \section{Structural transfer module}
3923 \label{ss:transfer}
3924
3925
3926 \nota{Faena per fer (mlf):
3927   \begin{itemize}
3928   \item Hi ha bastants vacil·lacions en la terminologia usada per a
3929   referir-se a conceptes i en els noms usats per als programes.
3930 \item He intentat substituir en cada cas l'expressió \emph{per
3931 defecte} per una altra més adequada; però caldrà distingir en quin cas
3932 ens trobem en cada cas.
3933   \end{itemize}}
3934
3935 \subsection{Introduction}
3936
3937 In 2007, Apertium incorporated a more advanced structural transfer system than
3938 the one used until then; it became necessary when we started developing
3939  machine translators for less related language pairs in
3940 comparison with the ones dealt with before, such as
3941 the \emph{English}--\emph{Catalan} translator.
3942
3943 This enhanced transfer system is made of three modules, the first one
3944 of which can be used in isolation in order to run a
3945 \textbf{shallow-transfer} system (which is the transfer system used so
3946 far for related language pairs such as \emph{Spanish}--\emph{Catalan} or
3947 \emph{Spanish}--\emph{Galician}).  When the system is used for less
3948 related language pairs and, therefore, an
3949 \textbf{advanced transfer} becomes necessary, the three transfer modules will be executed.
3950
3951 The two transfer systems differ in the number of passes over the input
3952 text.  The shallow-transfer system makes structural transformations
3953 with a single pass of the rules, which detect sequences or
3954 \emph{patterns} of lexical forms and perform on them the required
3955 verifications and changes. On the other hand, the advanced transfer
3956 system works with a new architecture that allows to detect
3957 \emph{patterns of patterns} of lexical forms with three passes, done
3958 by its three modules.
3959
3960 We describe next the characteristics of the structural transfer system.  Section
3961 \ref{functransfer} describes the shallow-transfer system and Section
3962 \ref{apertium2}, the advanced transfer system.  The description of the
3963 shallow-transfer system is also applicable to the first module of the
3964 advanced transfer system, with the differences mentioned in that
3965 section.  Section \ref{formatotransfer} describes the format used to
3966 create rules in both systems. In Section \ref{noutransfer} there is a
3967 detailed description of how the three modules of the advanced transfer
3968 system work, and finally, Section \ref{ss:preproceso_transfer}
3969 describes the pre-processing required by the modules.
3970
3971
3972 \subsection{Shallow-transfer}
3973 \label{functransfer}
3974
3975
3976 In this system, only the first of the three modules that compose the
3977 advanced transfer system is used. This module is called
3978 \emph{chunker}.
3979
3980 The design of the language and the compiler used to generate the
3981 structural transfer module is largely based upon the MorphTrans
3982 language described in \cite{garridoalenda01p} and used by the MT
3983 systems \textsf{interNOSTRUM}
3984 \cite{canals01b,garridoalenda01p,garrido99j} (Spanish--Catalan) and
3985 \textsf{Traductor Universia} \cite{garrido03p, gilabert03j}
3986 (Spanish--Portuguese), developed by the Transducens group at the
3987 Universitat d'Alacant.
3988
3989
3990 The transfer process is organized around patterns representing
3991 fixed-length sequences of source language lexical forms (SLLFs) (see
3992 page~\pageref{pg:FSFL} for a description of lexical form (LF)); a
3993 sequence follows a certain pattern if it contains the sequence of lexical forms
3994 of the pattern.  Patterns do not need to be constituents or
3995 phrases in the syntactic sense: they are mere concatenations of
3996 lexical forms that may need a conjoint processing additional to the
3997 simple word-for-word translation, due to the grammatical divergences
3998 between SL and TL (gender and number changes, reorderings,
3999 prepositional changes, etc). The catalogue of patterns defined for a
4000 certain language is selected with a view to covering the most common structural
4001 transformations.  When source language and target language
4002 are syntactically similar, as is the case between Spanish, Catalan and
4003 Galician, simple rules based on sequences of lexical categories
4004 achieve a reasonable translation quality.
4005
4006 The transfer module detects, in the SL, sequences of lexical forms
4007 that match one of the patterns previously defined in the pattern
4008 catalogue, and processes them applying the corresponding structural
4009 transfer rule, doing at the same time the lexical transfer by reading
4010 the bilingual dictionary.
4011
4012 The \emph{pattern detection} phase occurs as follows: if the transfer
4013 module starts to process the $i$-th SLLF of the text, $l_i$, it tries
4014 to match the sequence of SLLFs $l_i, l_{i+1}, \ldots$ with all of the
4015 patterns in its pattern catalogue: the longest matching pattern is
4016 chosen, the matching sequence is processed (see below), and processing
4017 continues at SLLF $l_{i+k}$, where $k$ is the length of the pattern
4018 just processed. If no pattern matches the sequence starting at SLLF
4019 $l_i$, it is translated as an isolated word an processing restarts at
4020 SLLF $l_{i+1}$ (when no patterns are applicable, the system resorts to
4021 word-for-word translation). Note that each SLLF is processed only
4022 once: patterns do not overlap; hence, processing occurs left to right
4023 and in distinct "chunks".
4024
4025
4026 In the \emph{pattern processing } phase, the system takes the detected
4027 sequence of SLLFs and builds (using a program to consult the bilingual
4028 dictionary) a sequence of TL lexical forms (TLLFs) obtained after the
4029 application of the operations described in the rule associated to the
4030 detected pattern (reordering, addition, replacement or deleting of
4031 words, inflection changes, etc.). The information that does not change
4032 is automatically copied from SL to TL. The resulting data, that is,
4033 the lemmas with their associated morphological tags, are sent to the
4034 generator, which creates the inflected forms.
4035
4036
4037
4038 For instance, the Spanish sequence \emph{una señal inequívoca} ("an
4039 unmistakable signal"), that would go from the tagger to the transfer
4040 module in the following format~\footnote{The example has been
4041 presented in a way that it does not contain superblanks with format
4042 information, so that the linguistic side of the transformation is
4043 clearer. See Chapter \ref{se:flujodatos}.}:\\
4044
4045 \begin{alltt}
4046 \begin{small}
4047 \textasciicircum\textbf{uno}<det><ind><f><sg>\$
4048 \textasciicircum\textbf{señal}<n><f><sg>\$
4049 \textasciicircum\textbf{inequívoco}<adj><f><sg>\$
4050 \end{small}
4051 \end{alltt}
4052
4053
4054 \noindent{would be detected as a pattern by a rule for
4055 determiner--noun--adjective.} The transfer module would consult the
4056 bilingual dictionary to get the Catalan equivalents and, as it would
4057 detect a gender change in the word \emph{señal} (its Catalan
4058 translation \emph{senyal} is masculine), it would propagate this
4059 change to the determiner and the adjective to deliver the output
4060 sequence:\\
4061
4062 \begin{alltt}
4063 \begin{small}
4064 \textasciicircum\textbf{un}<det><ind><m><sg>\$
4065 \textasciicircum\textbf{senyal}<n><m><sg>\$
4066 \textasciicircum\textbf{inequívoc}<adj><m><sg>\$
4067 \end{small}
4068 \end{alltt}
4069
4070 \noindent{which the generation module would turn into the Catalan
4071 inflected sequence: \emph{un senyal inequívoc}.}
4072
4073 The task of most rules is to ensure gender and number agreement in
4074 simple noun phrases (determi\-ner--noun, determiner--noun--adjective,
4075 determiner--adjective--noun, determiner--adjective, etc.), provided
4076 that there is agreement between the SLLFs of the detected
4077 pattern. These rules are required either because the noun changes its
4078 gender or number between SL and TL (as in the previous example) or
4079 because gender or number in the TL have to be determined due to the
4080 fact that it was ambiguous in SL for some of the words (for example,
4081 the Catalan determiner \emph{cap} can be translated into Spanish as
4082 \emph{ningún} (masc.) or \emph{ninguna} (fem.) depending on the
4083 accompanying noun: \emph{cap cotxe} (\texttt{ca}) $\rightarrow$
4084 \emph{ningún coche} (\texttt{es}) and \emph{cap casa} (\texttt{ca})
4085 $\rightarrow$ \emph{ninguna casa} (\texttt{es})). Furthermore, there
4086 other rules defined to solve frequent transfer problems between
4087 Spanish, Catalan and Galician, such as, among others:
4088
4089 \begin{itemize}
4090
4091
4092 \item rules to change prepositions in certain constructions: \emph{in
4093 Barcelona} (\texttt{es}) $\rightarrow$ \emph{a Barcelona}
4094 (\texttt{ca}); \emph{consiste en hacer} (\texttt{es}) $\rightarrow$
4095 \emph{consisteix a fer} (\texttt{ca});
4096
4097 \item rules to add/remove the preposition \emph{a} in certain Galician
4098 modal constructions with the verbs \emph{ir} and \emph{vir}: \emph{vai
4099 comprar} (\texttt{gl}) $\rightarrow$ \emph{va a comprar}
4100 (\texttt{es});
4101
4102 \item rules for articles before proper nouns: \emph{ve la Marta}
4103   (\texttt{ca}) $\rightarrow$ \emph{viene Marta} (\texttt{es});
4104
4105 \item lexical rules, for instance, to decide the correct translation
4106 of the adverb \emph{molt} (\texttt{ca}) into Spanish (\emph{muy,
4107 mucho}) or of the adjective \emph{primeiro} (\texttt{gl}) or
4108 \emph{primer} (\texttt{ca}) into Spanish (\emph{primer, primero});
4109
4110 \item rules to displace atonic or clitic pronouns, whose position in
4111 Galician is different to that in Spanish (proclitic in Galician and
4112 enclitic in Spanish or vice versa): \emph{envioume} (\texttt{gl})
4113 $\rightarrow$ \emph{me envió} (\texttt{es}); \emph{para nos dicir}
4114 (\texttt{gl}) $\rightarrow$ \emph{para decirnos} (\texttt{es}).
4115
4116 \end{itemize}
4117
4118
4119
4120 \emph{Multiwords} (its different types are described in
4121 page~\pageref{ss:multipalabras}) are processed in a special way in
4122 this module:
4123
4124 \begin{itemize}
4125 \item \emph{Multiwords without inflection}, made of only one lexical
4126 form, do not need any special processing, since they are treated like
4127 other LFs.
4128 \item In the case of \emph{compound multiwords}, that is, multiwords
4129 formed by more than one \emph{lexical form}, each one with its own
4130 grammatical symbols and joined to each other with the element
4131 \texttt{<j>} in the dictionary entry (which corresponds to the symbol
4132 '+' in the data stream), the auxiliary module \texttt{pretransfer}
4133 (see \ref{se:pretransfer}), located before this module, separates the
4134 different lexical forms so that they reach the transfer module as
4135 independent LFs. If we want to join them again so that they reach the
4136 generator as multiwords (as is the case of enclitic pronouns in our
4137 system), it has to be done by means of a transfer rule, using the
4138 \texttt{<\textbf{mlu}>} element (described later, in section
4139 \ref{ss:mlu}). In page~\pageref{regla_verbo2} you can find an example
4140 of a rule for joining enclitic pronouns to the verb.
4141 \item As for \emph{multiwords with inner inflection}, the
4142 \texttt{pre\-trans\-fer} module moves the lemma queue (the invariable
4143 part) to place it after the lemma head (the inflective form), thus
4144 making possible to find the multiword in the bilingual
4145 dictionary. This kind of multiwords must be processed by a structural
4146 transfer rule which replaces the lemma queue in its proper
4147 position. This is done by using, in the output of the rule, the attributes
4148 \texttt{lemh} ``lemma head'' and \texttt{lemq} ``lemma queue'') of the
4149 \texttt{<\textbf{clip}>} element. See page~\pageref{ss:lu} for a more
4150 detailed description of the use of this element, and page
4151 \pageref{regla_verbo1} to see two rules where these attributes are
4152 used.
4153 \end{itemize}
4154
4155
4156 \subsection{Advanced transfer}
4157 \label{apertium2}
4158
4159 The shallow-transfer architecture described in the previous section is
4160 based, as we have seen, in the automatic handling of word
4161 co-occurrence patterns by means of rules defined by the user. This
4162 model considers two levels from the point of view of the nature of
4163 data: a basic level we call \textit{lexical level}, which handles
4164 words and the tasks of consulting and changing its characteristics
4165 (lemma and tags), besides translating individual lemmas by asking the
4166 bilingual dictionary; and another level we call \textit{word pattern
4167 level}, which is in charge of doing, when applicable, reorderings of
4168 the words that build these patterns, as well as changes in the
4169 properties of words that depend on the specific pattern that has been
4170 detected. All this process of detection and manipulation of words and
4171 patterns is carried out in a single pass.
4172
4173 In contrast, the new advanced transfer architecture is defined as a
4174 transfer system in three levels and three passes. The first two
4175 levels, lexical and pattern level, are the same ones of the
4176 shallow-transfer system. The new added level is a level of
4177 \emph{patterns of patterns} of words. The aim of this new processing
4178 level is to allow the handling and interaction of patterns of words in
4179 a similar way as words are handled in the patterns of the shallow
4180 system. With this new structure we intend to achieve a more
4181 appropriate handling of all transformations that may be required when
4182 translating from one language to another. We want to emphasize that
4183 the definition of word patterns in the shallow-transfer system does
4184 not need to be the same as the definition of word patterns in the
4185 advanced system: we pretend that, in the latter, patterns have a
4186 \textit{spirit} of phrases that does not exist in the previous
4187 system. Therefore we will use the term \textit{chunk} to refer to word
4188 sequences in the advanced transfer system.
4189
4190 The advanced transfer system is organized in three passes. According
4191 to the Apertium processing mode, these three passes are carried out by
4192 three different modules (programs):
4193
4194 \begin{itemize}
4195 \item \texttt{chunker}: identifies chunks, translates word for word,
4196 and carries out required reorderings and morphosyntactic data
4197 propagation inside the chunk (for example, to maintain
4198 agreement). Besides, it creates the chunks that will be processed by
4199 the next module.  The \texttt{chunker} has the option of running as a
4200 single module in a shallow-transfer system.  This is controlled by an
4201 attribute in the \texttt{<transfer>} element.
4202
4203
4204 \item \texttt{interchunk}: this module receives the chunks generated
4205 by the \texttt{chunker} and is able to reorder them, modify the
4206 ``syntactic information'' associated to each chunk and, finally,
4207 output the chunks in the new order and with the new properties,
4208 creating new chunks if needed.
4209 \item \texttt{postchunk}: it receives the chunks modified by the
4210 interchunk and carries out final tasks concerning modification of the
4211 words contained in each chunk and printing of the text contained in
4212 chunks in the format accepted by the generator.
4213 \end{itemize}
4214
4215
4216 In the following lines we specify the format of the chunks that
4217 circulate between the modules of the transfer system (Section
4218 \ref{sec:format}) and the letter case handling in chunks (Section
4219 \ref{ss:majuscules}), which is different from case handling of
4220 individual lexical forms in a shallow-transfer system.
4221
4222
4223 The following section, \ref{formatotransfer}, describes the format of
4224 transfer rules, which is the same for the three modules and the two
4225 transfer modes, with little differences.  Finally, after this
4226 description, in \ref{noutransfer} you will find a more detailed
4227 explanation of the three modules that make up an advanced transfer
4228 system.
4229
4230
4231
4232
4233 \subsubsection{Chunk format}
4234 \label{sec:format}
4235
4236
4237 Communication between \texttt{chunker} and \texttt{interchunk}, as
4238 well as between \texttt{interchunk} and \texttt{postchunk}, is
4239 performed through sequences of chunks. We define $C$ as a
4240 \emph{sequence of chunks}, that has the form:
4241 $$
4242 C=b_{0}c_{1}b_{1}c_{2}b_{2} \ldots b_{k-1}c_{k}b_{k}
4243 $$
4244
4245 where each $b_i$ is a \textit{superblank}, and each $c$ is a
4246 \emph{chunk}. A chunk $c$ is defined as a string
4247 \verb!^!$F$\verb!{!$W$\verb!}$! that contains the following
4248 information:
4249
4250 \begin{itemize}
4251 \item $F$ is the \emph{lexical pseudoform}\nota{help: pseudoforma
4252 lèxica = lexical pseudoform or pseudolexical form}; it is a string
4253 that has the form $fE$, where $f$ is the \textit{pseudolemma} of the
4254 chunk, and $E=e_{1}e_{2} \ldots$ is a sequence of grammatical symbols
4255 called \emph{chunk symbols}. Changing these symbols will cause the
4256 changing of the morphological information of words in the chunk, if
4257 this information is linked to these parameters.
4258 \item $W=b_{0}w_{1}b_{1}w_{2}b_{2} \ldots w_{k}b_{k}$ is the sequence
4259 of words $w_i$ sent by the chunker with the intermediate
4260 \textit{superblanks} $b_i$. These words have the same format in both
4261 transfer systems, that is, an individual word
4262 $w_i=$\verb!^!$l_{i}E_{i}$\verb!$!  contains lemma $l_i$ and
4263   grammatical symbols $E_i$, some of which can be \emph{references or links
4264   to the symbols} of the chunk and are identified with natural numbers
4265   \texttt{<1>}, \texttt{<2>}, \texttt{<3>}, etc. These references to
4266   symbols correspond, in the specified order, to the symbols of $E$.
4267 \end{itemize}
4268
4269 The following is a use example of the described format, with the text
4270 \emph{el gat} ("the cat"):
4271
4272 \begin{small}
4273 \begin{alltt}
4274 \verb!^!det_nom<SN><m><pl>\verb!{^!el<det><def><2><3>$[
4275 <a href="http://www.ua.es">]^gat<n><2><3>$\verb!}$![</a>]
4276 \end{alltt}
4277 \end{small}
4278
4279 The characters \verb!{! and \verb!}!, if present in the original text,
4280 must be escaped with a backslash \verb!\!.
4281
4282 \subsubsection{Letter case handling}
4283 \label{ss:majuscules}
4284
4285 For each chunk, the case of words is determined by the case of the
4286 pseudolemma of the chunk, taking into account the following rules:
4287
4288 \begin{itemize}
4289
4290 \item When all the letters of the pseudolemma are in lower case: the
4291 case state of words is not modified.
4292 \item When the first letter of the pseudolemma is in upper case and
4293 the rest are in lower case: in the module \texttt{postchunk}, when
4294 words are printed, the letter that is the first of the chunk after all
4295 the possible word reorderings will be put in upper case \nota{MG: and
4296 the rest will be put in lower case except proper nouns? is this
4297 correct?}.
4298  \item When all the letters of the pseudolemma are in upper case: all
4299  the words will remain upper case.
4300 \end{itemize}
4301
4302
4303 It is required that the words in the chunk are not capitalized unless
4304 they are proper nouns, so as to avoid the postchunk module having to
4305 look for the word that has to lose capitalization, if this is the
4306 case\nota{MG: I am not sure I understand this}. This task belongs to
4307 the \texttt{chunker} module and is done with a macro or similar
4308 mechanism.
4309
4310
4311 %\settocdepth{subsection}
4312 \subsection{Format specification for structural transfer rules}
4313 \label{formatotransfer}
4314
4315
4316 This section describes the format in which structural transfer rules
4317 are written. In the Appendix, in sections~\ref{ss:dtdtransfer},
4318 \ref{ss:dtdinterchunk} and \ref{ss:dtdpostchunk}, there is the formal
4319 definition (DTD).
4320
4321 Structural transfer rules files have two well-differentiated parts:
4322 one for the declaration of the elements to be used in rules, and
4323 another one for the rules themselves.\\
4324
4325
4326 In the \textbf{declaration} part we find:
4327
4328 \begin{itemize}
4329
4330 \item A series of declarations of \emph{lexical categories}, which
4331 specify those lexical forms that will be treated as a particular
4332 category and will be detected by patterns.  The linguist may include any data about the lexical form
4333 to define a category; categories can be very generic (i.e. all the
4334 nouns) or very specific (i.e. only those determiners that are
4335 demonstrative feminine plural).
4336 \item A series of declarations of the \emph{attributes} we want to
4337 detect in lexical forms (like \emph{gender}, \emph{number},
4338 \emph{person} or \emph{tense}), to perform with them the required
4339 transformation operations and send the resulting data in the output of
4340 the rules. The declaration of an attribute contains the name of the
4341 attribute and the possible values it can take in a lexical form (in
4342 general they correspond to the morphological attributes that
4343 characterize the form): for example, the attribute \emph{number} can
4344 take the values \emph{singular}, \emph{plural}, \emph{singular-plural}
4345 (for invariable lexical forms, like \emph{crisis} in Spanish) and
4346 \emph{number to be determined} (for TL lexical forms with different
4347 forms for \emph{singular}--\emph{plural}, but whose number can not be
4348 determined in the translation due to the fact that the SL lexical form
4349 is invariable in number, see explanation in page \pageref{pg:GD}). If
4350 inside the rule, outside of the pattern, one wishes to refer to any of
4351 the lexical categories defined in the previous point (to perform tests
4352 or actions on them), it will be also necessary to define attributes
4353 for them.
4354
4355 \item A series of declarations of \emph{global variables}, which are
4356 used to transfer values of active attributes inside a rule, or from
4357 one rule to the ones applied subsequently.
4358
4359 \item A section for the \textit{definition of string lists}, generally
4360 lists of lemmas, which will be used to make searches on them for a certain value
4361 to perform a specific transformation.
4362
4363 \item A series of declarations of \emph{macro-instructions};
4364 macro-instructions contain sequences of frequently used instructions,
4365 and can be included in different rules (for example, a
4366 macro-instruction to ensure gender and number agreement between two
4367 lexical forms of a pattern).
4368
4369 \end{itemize}
4370
4371 In the \textbf{structural transfer rules} we find:
4372
4373 \begin{itemize}
4374 \item The definition of the pattern that will be detected, specified
4375 as a sequence of lexical categories as they have been defined in the
4376 declaration part. It must be noted that, if a sequence of lexical
4377 forms matches two different rules, firstly, the longest is chosen, and
4378 secondly, for rules of the same length, the one defined before is
4379 chosen.
4380
4381 \item The process part of the rules, where actions to be performed on
4382 SLLF are specified, and the TL pattern is built.
4383
4384 \end{itemize} \nota{Assegurem-nos que totes les sigles estan
4385 definides}
4386
4387 In the following pages we describe in detail the characteristics of
4388 all the elements used in rules.
4389
4390
4391 \subsubsection{Element \texttt{<transfer>}}
4392
4393 (\textit{Only in the chunker module})
4394
4395 This is the root element of the \texttt{chunker} module and contains
4396 all the rest of the elements of the structural transfer rules file of
4397 this module.
4398
4399 Its attribute \texttt{default} can take two values:
4400 \begin{itemize}
4401
4402 \item \texttt{lu}: it means that it will run in shallow mode, that is,
4403 as only transfer module in a shallow-transfer system and, therefore,
4404 no special action will be done on words not detected by any pattern
4405
4406 \item \texttt{chunk}: it means that it will run in advanced mode and,
4407 therefore, when a word is not recognized by any rule, a chunk will be
4408 created to encapsulate it, so that it can be processed by the next
4409 transfer modules of an advanced transfer system.
4410
4411 \end{itemize}
4412
4413 The default value is \texttt{lu}.
4414
4415 \subsubsection{Element \texttt{<interchunk>}}
4416
4417 (\textit{Only in interchunk})
4418
4419 This is the root element of the \texttt{interchunk} module and
4420 contains all the rest of the elements of the structural transfer rules
4421 file of this module.
4422
4423
4424 \subsubsection{Element \texttt{<postchunk>}}
4425
4426
4427 (\textit{Only in postchunk})
4428
4429 This is the root element of the \texttt{postchunk} module and contains
4430 all the rest of the elements of the structural transfer rules file of
4431 this module.
4432
4433
4434
4435 \subsubsection{Element for category definition section
4436 \\\texttt{<section-def-cats>}} \nota{Atenció a l'ús polisèmic del mot
4437 \emph{categoria} en el document}
4438
4439 This section contains the definition of the lexical categories that
4440 will be used to create the patterns used in rules. Each definition is
4441 made with a \texttt{<\textbf{def-cat}>}.
4442
4443
4444
4445 \subsubsection{Element for category definition \texttt{<def-cat>}}
4446
4447 Each category definition has a mandatory name \texttt{n}
4448 (e.g. \texttt{det}, \texttt{adv}, \texttt{prep}, etc.) and a list of
4449 categories (\texttt{<\textbf{cat-item}>}) that define it. The name of
4450 the category can not contain accents.
4451
4452
4453 \subsubsection{Element for category \texttt{<cat-item>}}
4454
4455
4456 This element has two well-differentiated uses depending on the module
4457 it is used in.
4458
4459 \paragraph{Use in chunker (shallow transfer and advanced transfer)}
4460
4461
4462 This element defines the lexical categories that will be used in
4463 patterns, that is, that the linguist wishes to detect in the source
4464 text. These categories are defined by a subsequence of the fine tags
4465 (see definition in page~\pageref{ss:introtagger}) that deliver both
4466 the morphological analyser and the tagger\footnote{Please note that
4467 throughout the different linguistic modules, different lexical
4468 categorizations are used: in morphological dictionaries, lemmas are
4469 accompanied by a fine tag (for instance, \texttt{\emph{<n><m><pl>}}
4470 for plural masculine nouns); the POS tagger groups these fine tags in
4471 more general tags (for instance, the category \texttt{NOUN} for all
4472 the nouns), although its output is again the whole fine tag of each
4473 LF; finally, in the transfer module, the fine tags of LFs are grouped
4474 again in more general categories (although it is also possible to
4475 define particularized categories) depending on the type of lexical
4476 forms that one wants to detect in patterns.}.
4477
4478 Each \texttt{<\textbf{cat-item}>} element has a mandatory attribute
4479 \texttt{tags} whose value is a sequence of grammatical symbols
4480 separated by a dot; this sequence is a subsequence of the fine tag,
4481 that is, of the sequence of grammatical symbols that defines every
4482 possible lexical form delivered by the tagger.  According to this, a
4483 category represents a certain set of lexical forms.  We must define as
4484 many different categories as kinds of lexical forms we want to detect
4485 in patterns. Thus, if we want to detect all the nouns to perform
4486 certain actions on them, we will create a category defined with the
4487 grammatical symbol \texttt{n}. On the other hand, if we want to detect
4488 all the plural feminine nouns, we will have to define a category using
4489 the symbols \texttt{n} \texttt{f} and \texttt{pl}.
4490
4491
4492
4493 When, for the set of lemmas we want to include in a category, a
4494 grammatical symbol used to define the category is followed by other
4495 grammatical symbols, the character \texttt{"*"} is used. For example,
4496 \texttt{tags}=\texttt{"n.*"} covers all the lexical forms that contain
4497 this symbol, such as the Spanish nouns \texttt{casa<n><f><pl>} or
4498 \texttt{coche<n><m><sg>}. On the other hand, when after the used
4499 symbol there can not be any other symbol, the asterisk is not
4500 included: for example, \texttt{tags}=\texttt{"}\texttt{adv"} will
4501 cover all adverbs, since in our system they are characterized with
4502 only one grammatical symbol. The asterisk can also be used to signal
4503 the existence of preceding symbols: \texttt{tags}=\texttt{"*.f.*"}
4504 includes all feminine lexical forms, whichever category they
4505 are. Furthermore, an optional attribute, \texttt{lemma}, can be used
4506 to define lexical forms on the basis of its lemma (see Figure
4507 \ref{fig:cat-item}).
4508
4509
4510
4511 \begin{figure}
4512 \begin{small}
4513 \begin{alltt}
4514 <\textbf{def-cat} \textsl{n}="nom"/>
4515   <\textbf{cat-item} \textsl{tags}="n.*"/>
4516 </\textbf{def-cat}>
4517
4518 <\textbf{def-cat} \textsl{n}="que"/>
4519   <\textbf{cat-item} \textsl{lemma}="que" \textsl{tags}="cnjsub"/>
4520   <\textbf{cat-item} \textsl{lemma}="que" \textsl{tags}="rel.an.mf.sp"/>
4521 </\textbf{def-cat}>
4522 \end{alltt}
4523 \end{small}
4524 \caption{Use of the \texttt{<\textbf{cat-item}>} element to define two
4525   categories, one for nouns without lemma specification (\emph{nom}),
4526   which includes all lexical forms whose first grammatical symbol is
4527   \emph{n}, and another one with associated lemma (\emph{que}), which
4528   has two subsequences of fine tags, to include the \emph{que}
4529   conjunction and the \emph{que} relative pronoun.}
4530 \label{fig:cat-item}
4531 \end{figure}
4532
4533
4534 \paragraph{Use in interchunk}
4535
4536
4537 It is used like in the \texttt{chunker} module, but here, instead of
4538 being defined with the grammatical symbols of lexical forms, it is
4539 defined with the symbols of the chunks delivered by the
4540 \texttt{chunker}. For example, in the case that we want to define a
4541 category to detect all the determined noun phrases, we will define it
4542 with the symbols \texttt{NP} and \texttt{DET} if this is how we tagged
4543 these chunks by means of the \texttt{<tag>} instructions contained in
4544 the \texttt{<chunk>} element (see \ref{ss:chunker}). You can also use
4545 the optional attribute \texttt{lemma} to refer to the
4546 \emph{pseudolemma} of the chunk. So, its formal characteristics are
4547 the same in the modules \texttt{chunker} and \texttt{interchunk}, with
4548 the difference that in the former they are used to detect lexical
4549 forms, and in the latter, to detect chunks.
4550
4551
4552 \paragraph{Use in postchunk}
4553
4554 In this module, this element only has the mandatory attribute
4555 \texttt{name}, which refers to the name of the chunk,
4556
4557 \nota{MG: abans deia 'al nom de la regla', comentari mlf: De la regla
4558 o del patró?}  without tags, since in the \texttt{postchunk} module
4559 only the pseudolemma (name of the chunk) is used for detection.  Case
4560 is ignored in detection, because the pseudolemma is used to convey
4561 information about the case of the chunk. (See Figure
4562 \ref{fig:cat-item-postchunk}).
4563
4564 \begin{figure}
4565 \begin{small}
4566 \begin{alltt}
4567 <\textbf{def-cat} \textsl{n}="det-nom"/>
4568   <\textbf{cat-item} \textsl{name}="det-nom"/>
4569 </\textbf{def-cat}>
4570 \end{alltt}
4571 \end{small}
4572 \caption{Use of the \texttt{<\textbf{cat-item}>} element in the
4573 postchunk to detect chunks of determiner-noun.}
4574 \label{fig:cat-item-postchunk}
4575 \end{figure}
4576
4577
4578
4579 \subsubsection{Element for category attribute definition section
4580 \\\texttt{<section-def-attrs>}}
4581
4582
4583 This section is to describe the attributes that will be extracted
4584 from the categories detected by the pattern and that will be used in
4585 the action part of the rules. Each attribute is defined by a
4586 \texttt{<\textbf{def-attr}>} tag.
4587
4588 \nota{De vegades les etiquetes aprareixen en el text en negretes i de
4589 vegades sense negretes. Decidim-nos per una tipografia i usem-la en
4590 tot el document.}
4591
4592
4593 \subsubsection{Element for category attribute definition
4594 \\\texttt{<def-attr>}}
4595
4596 Each \texttt{<\textbf{def-attr}>} defines an attribute regarding
4597 morphological information (both inflection information --gender,
4598 number, person, etc.--, and categorial --verb, adjective, etc--) by
4599 specifying a list of category attribute
4600 (\texttt{<\textbf{attr-item}>}) elements, and has a mandatory unique
4601 name \texttt{n}. Therefore, an attribute is defined on the basis of
4602 the grammatical symbols that can be found in a given lexical
4603 form. Each attribute extracts, from the lexical forms of the pattern,
4604 the symbols that these contain among the set of possible values
4605 defined.
4606
4607 \subsubsection{Element for category attribute \texttt{<attr-item>}}
4608
4609 Each category attribute element represents one of the possible values
4610 the attribute can take. For example, the attribute for number
4611 \texttt{nbr} can take the values singular \texttt{sg}, plural
4612 \texttt{pl}, singular--plural \texttt{sp} and number to be determined
4613 \texttt{ND}. These values are a subsequence of the morphological tags
4614 that characterize each lexical form, and are specified in the
4615 \texttt{tags} attribute of the element, separated by a dot if there is
4616 more than one. In Figure \ref{fig:attr-item} you can find an example
4617 for the attributes for \emph{number} and \emph{noun}.  \nota{Potser
4618 s'hauria d'explicar per què s'ha triat el nom \emph{a\_nom} en la
4619 figura}
4620
4621 Compare the definition of the attribute for number in this figure
4622 (with all possible values and without asterisks) with the definition
4623 of the category for noun in Figure \ref{fig:cat-item}.
4624
4625
4626
4627 \begin{figure}
4628 \begin{small}
4629 \begin{alltt}
4630 <\textbf{def-attr} \textsl{n}="nbr"/>
4631   <\textbf{attr-item} \textsl{tags}="sg"/>
4632   <\textbf{attr-item} \textsl{tags}="pl"/>
4633   <\textbf{attr-item} \textsl{tags}="sp"/>
4634   <\textbf{attr-item} \textsl{tags}="ND"/>
4635 </\textbf{def-attr}>
4636
4637 <\textbf{def-attr} \textsl{n}="a_nom"/>
4638   <\textbf{attr-item} \textsl{tags}="n"/>
4639   <\textbf{attr-item} \textsl{tags}="n.acr"/>
4640 </\textbf{def-attr}>
4641
4642 \end{alltt}
4643 \end{small}
4644 \caption{Definition of the category attribute \texttt{nbr} for
4645   \emph{number}, which can take the values \emph{singular},
4646   \emph{plural}, \emph{singular-plural} or
4647  \emph{number to be determined}, and the category attribute
4648 \texttt{a\_nom} for \emph{noun}, which can take the values of the
4649 symbols \emph{n} or \emph{n acr}.}
4650 \label{fig:attr-item}
4651 \end{figure}
4652
4653
4654 \subsubsection{Element for variable definition section
4655 \\\texttt{<section-def-vars>}}
4656
4657 In this section, \texttt{<\textbf{def-var}>} tags are used to define
4658 global string variables, that will be used to transfer information
4659 inside the rule and from one rule to another one (for example, to
4660 transmit information on gender or number between two patterns)
4661
4662
4663 \nota{Que quede clar que aquesta transferència d'una regla a altra es
4664 fa només d'una aplicació d'una regla a l'aplicació d'altra regla en un
4665 moment posterior, o d'esquerra a dreta}
4666
4667 \subsubsection{Element for variable definition \texttt{<def-var>}}
4668 \label{ss:defvar} The definition of a global string variable has a
4669 mandatory unique name \texttt{n} that will be used to refer to it
4670 inside a rule.  Variables contain strings that describe state
4671 information, such as the existence of agreement between two elements,
4672 the detection of a question mark in SL that should be deleted in TL,
4673 etc.
4674
4675
4676 \subsubsection{Element for string lists definition section
4677 \\\texttt{<section-def-lists>}} In this section, lists are defined
4678 (with \texttt{<\textbf{def-list}>} tags) that will be used to do
4679 string searches.  These lists can be used to group word lemmas that
4680 have a common feature (i.e. verbs expressing movement, adjectives
4681 expressing emotions, etc.).  This section is optional.
4682
4683 \subsubsection{Element for string lists definition
4684 \texttt{<def-list>}} This element is used to name the string list,
4685 with the attribute \texttt{n}, and to encapsulate the list defined by
4686 one or more \texttt{<\textbf{list-item}>} elements. An example of its
4687 use can be found in Figure \ref{fig:deflist}.
4688
4689 \subsubsection{Element for string list item \texttt{<list-item>}} It
4690  defines, with the value of the attribute \texttt{v}, the specific
4691  string that is included in the definition of the list.  An example of
4692  its use can be found in Figure \ref{fig:deflist}.
4693
4694
4695
4696
4697 \begin{figure}
4698 \begin{small}
4699 \begin{alltt}
4700 <\textbf{def-list} n="verbos_est">
4701   <\textbf{list-item} v="actuar"/>
4702   <\textbf{list-item} v="buscar"/>
4703   <\textbf{list-item} v="estudiar"/>
4704   <\textbf{list-item} v="existir"/>
4705   <\textbf{list-item} v="ingressar"/>
4706   <\textbf{list-item} v="introduir"/>
4707   <\textbf{list-item} v="penetrar"/>
4708   <\textbf{list-item} v="publicar"/>
4709   <\textbf{list-item} v="treballar"/>
4710   <\textbf{list-item} v="viure"/>
4711 <\textbf{/def-list}>
4712 \end{alltt}
4713 \end{small}
4714 \caption{Definition of a list of Catalan lemmas. These lemmas are used
4715 in the rule in Figure \ref{fig:in}.}
4716 \label{fig:deflist}
4717 \end{figure}
4718
4719
4720 \subsubsection{Element for macro-instruction definition section
4721 \\\texttt{<section-def-macros>}}
4722
4723 This section is for the definition of macro-instructions that contain
4724 pieces of code used frequently in the action part of the rules.
4725
4726 \subsubsection{Element for macro-instruction definition
4727 \texttt{<def-macro>}}
4728
4729 Each macro-instruction definition has a mandatory name (the value of
4730 the attribute \texttt{n}), the number of arguments it receives
4731 (attribute \texttt{npar}) and a body with instructions.
4732
4733
4734 \subsubsection{Element for rules section \texttt{<section-rules>}}
4735
4736 This section contains the structural transfer rules, each one in a
4737 \texttt{<\textbf{rule}>} element.
4738
4739 \subsubsection{Element for rule \texttt{<rule>}}
4740
4741 Each rule has a pattern (\texttt{<\textbf{pattern}>}) and the
4742 associated action (\texttt{<\textbf{action}>}) performed when the
4743 pattern is matched.
4744
4745 The rule can have an optional attribute \texttt{comment} with a
4746 comment on, usually, the function of the rule.
4747
4748 \subsubsection{Element for pattern \texttt{<pattern>}}
4749
4750 A pattern is specified using pattern items
4751 (\texttt{<\textbf{pattern-\\item}>}), each one of which corresponds to
4752 a lexical form in the matched pattern, in order of appearance.
4753
4754 \subsubsection{Element for pattern constituent
4755 \texttt{<pattern-item>}}
4756
4757 Each pattern item specifies, in the attribute with mandatory name
4758 \texttt{n}, which kind of lexical form is to be matched.  To do that,
4759 one has to use the categories defined in
4760 \texttt{<\textbf{section-def-cats}>} (see in Figure \ref{fig:regla}
4761 the definition of a pattern for determiner--noun ).
4762
4763
4764 \subsubsection{Element for action \texttt{<action>}}
4765
4766 This element contains the ``instructions'' that have to be executed to
4767 process as desired each matched pattern.
4768
4769 The processing part for matched patterns is a block of zero or more
4770 instructions of the kind: \texttt{<\textbf{choose}>} (conditional
4771 processing), \texttt{<\textbf{let}>} (value assignment),
4772 \texttt{<\textbf{out}>} (print TL lexical forms),
4773 \texttt{<\textbf{modify-case}>} (modify case state of a lexical form),
4774 \texttt{<\textbf{call-macro}>} (call a macro-instruction) and
4775 \texttt{<\textbf{append}>} (concatenate strings).
4776
4777
4778 Through the processing step, depending on whether a series of
4779 conditional options are met or not, different operations are carried
4780 out, such as creating agreement between pattern components, necessary
4781 when these undergo gender or number changes in the lexical transfer
4782 process. To do this, in spite of working with TLLF, also the SL
4783 information is taken into account, since, for example, if pattern
4784 components do not agree in SL, maybe they do not have to agree in TL
4785 either. As a consequence of the application of the different
4786 operations in a pattern, values are assigned to pattern attributes
4787 and, if applicable, to global or state variables, and the information
4788 on the resulting TL pattern is sent to the next module (the
4789 morphological generator in a shallow-transfer system, or the next
4790 transfer module in an advanced transfer system).
4791
4792
4793 \subsubsection{Element for macro-instruction call
4794 \texttt{<call-macro>}}
4795
4796 In a rule it is possible to call any of the macro-instructions defined
4797 in \texttt{<\textbf{section-def-macros}>}. To do this, one has to
4798 specify the name of the macro-instruction in the \texttt{n} attribute,
4799 and one or more arguments in the parameter element
4800 \texttt{<\textbf{with-param}>} (see next).
4801
4802 \subsubsection{Element for parameters \texttt{<with-param>}}
4803
4804 This element is used inside a macro-instruction call
4805 \texttt{<\textbf{call-macro}>}. The \texttt{pos} attribute of an
4806 argument is used to refer to a lexical form of the rule from where the
4807 macro-instruction is called. For example, if a macro-instruction with
4808 2 parameters has been defined, to make agreement operations between
4809 noun--adjective, it can be used with arguments 1 and 2 in a rule for
4810 noun--adjective, with arguments 2 and 3 in a rule for
4811 determiner--noun--adjective, with arguments 1 and 3 in a rule for
4812 noun--adverb--adjective and with arguments 2 and 1 in a rule for
4813 adjective--noun. You can see an example of macro-instruction call in
4814 Figure \ref{fig:macro}.
4815
4816 \begin{figure}
4817 \begin{small}
4818 \begin{alltt}
4819 <\textbf{call-macro} n="f_concord2">
4820   <\textbf{with-param} pos="3"/>
4821   <\textbf{with-param} pos="1"/>
4822 <\textbf{/call-macro}>
4823 \end{alltt}
4824 \end{small}
4825 \caption{Call of the macro-instruction \texttt{f-concord2} designed to
4826 create agreement between two elements in a pattern such as
4827 determiner--adverb--noun. Propagation of gender and number is done
4828 from one of the components, in this case, from the noun which is the
4829 third element of the pattern (3). Therefore, the position of the noun
4830 is the first parameter given, and the other parameters come
4831 next. Since the adverb (in position 2) does not need agreement
4832 information, only the position of the determiner is specified (1).}
4833 \label{fig:macro}
4834 \end{figure}
4835
4836
4837
4838 \subsubsection{Element for selection \texttt{<choose>}}
4839 \label{choose}
4840
4841 The selection instruction consists of one or more conditional options
4842 (\texttt{<\textbf{when}>}) and an alternative option
4843 \texttt{<\textbf{otherwise}>}, which is optional.
4844
4845
4846 \subsubsection{Element for condition \texttt{<when>}}
4847
4848 This element describes a conditional option (see Section
4849 \ref{choose}).  It contains the condition to be tested
4850 \texttt{<\textbf{test}>} and one block of zero or more instructions of
4851 the kind \texttt{<\textbf{choose}>}, \texttt{<\textbf{let}>},
4852 \texttt{<\textbf{out}>}, \texttt{<\textbf{modify-case}>},
4853 \texttt{<\textbf{call-macro}>} or \texttt{<\textbf{append}>}, \nota{OK
4854 append?} which will be executed if the above condition is met.
4855
4856 \subsubsection{Element for alternative option \texttt{<otherwise>}}
4857
4858 The element \texttt{<\textbf{otherwise}>} contains one block of one or
4859 more instructions (of the kind \texttt{<\textbf{choose}>},
4860 \texttt{<\textbf{let}>}, \texttt{<\textbf{out}>},
4861 \texttt{<\textbf{modify-case}>}, \texttt{<\textbf{call-macro}>} and
4862 \texttt{<\textbf{append}>}) that must be executed if none of the
4863 conditions described in the \texttt{<\textbf{when}>} elements of a
4864 \texttt{<\textbf{choose}>} is met.
4865
4866 \subsubsection{Element for evaluation \texttt{<test>}}
4867
4868 The test element \texttt{<\textbf{test}>} in a condition element
4869 \texttt{<\textbf{when}>} can contain a conjunction
4870 (\texttt{<\textbf{and}>}), a disjunction (\texttt{<\textbf{or}>}) or a
4871 negation (\texttt{<\textbf{not}>}) of conditions to be tested, as well
4872 as a simple condition of string equality (\texttt{<\textbf{equal}>}),
4873 string beginning (\texttt{<\textbf{begins-with}>}), string end
4874 (\texttt{<\textbf{ends-with}>}), substring
4875 (\texttt{<\textbf{contains-substring}>}) or inclusion in a set
4876 (\texttt{<\textbf{in}>}).
4877
4878 \nota{Segur que es pot millorar la redacció de l'últim paràgraf,
4879 canviat per mlf perquè hi estiguen totes les condicions booleanes
4880 simples.}
4881
4882 \subsubsection{Elements for conditional or boolean operators:
4883 \texttt{<equal>}, \texttt{<and>}, \texttt{<or>}, \texttt{<not>},
4884 \texttt{<in>}}
4885
4886 \nota{To be completed: add \texttt{contains-substring},
4887 \texttt{ends-with}, \texttt{begins-with}, etc.}
4888
4889 \begin{itemize}
4890
4891 \item The conjunction element \texttt{<\textbf{and}>} represents a
4892 condition, consisting of two or more conditions, that is met when all
4893 included conditions are true. An example of its use can be found in
4894 Figure \ref{fig:regla}.
4895
4896 \item The disjunction element \texttt{<\textbf{or}>} represents a
4897 condition, consisting of two or more conditions, that is met when at
4898 least one of the included conditions is true. Figure \ref{fig:ornot}
4899 displays an example of this condition type used when testing gender
4900 agreement in a SL pattern.
4901
4902 \item The negation element \texttt{<\textbf{not}>} represents a
4903 condition that is met when the included condition is not met, and vice
4904 versa. An example of negation of an equality can be found in Figure
4905 \ref{fig:ornot}.
4906
4907 \item The conditional equality operator \texttt{<\textbf{equal}>} is
4908 an instruction that evaluates if two arguments (two strings) are
4909 identical or not. See examples of its use in Figures \ref{fig:clip}
4910 and \ref{fig:lit-tag}.  In addition, this operator can have the
4911 attribute \texttt{caseless}, which, when its value is \texttt{yes},
4912 causes the comparison of strings to be made ignoring case.  \nota{All
4913 string conditional tests have the attribute \texttt{caseless}; also
4914 \texttt{in} described below}
4915
4916 \item The "search in lists" operator \texttt{<\textbf{in}>} is used to
4917 search for any value (specified as the first parameter of the condition)
4918 in a list referred to by the \texttt{n} attribute of the
4919 \texttt{<\textbf{list}>} element; this list must be defined in the
4920 appropriate section (\texttt{<\textbf{section-def-lists}}).  The
4921 search result is true if the value is found in the list.  This
4922 comparison can also use the attribute \texttt{caseless}: if its value
4923 is \texttt{yes}, the search is done ignoring case. Figure \ref{fig:in}
4924 shows an example of its use.
4925
4926 \end{itemize}
4927
4928 \nota{Cal unificar tota la discussió anterior, traient factor comú.}
4929
4930 \nota{Cal descriure la resta d'elements condicionals que no hi són.}
4931
4932
4933 \subsubsection{Element \texttt{<clip>}}
4934 \label{ss:clip}
4935
4936
4937 The \texttt{<\textbf{clip}>} element represents a substring of a SL or
4938 TL lexical form, defined by the value of its different attributes (see an
4939 example in Figure \ref{fig:clip}):
4940
4941 \begin{itemize}
4942 \item \texttt{pos} is an index (1, 2, 3, etc.) used to select a
4943 lexical form inside a rule: it refers to the place the lexical form
4944 occupies in the pattern. In the \textit{postchunk} module there is
4945 also the index ``0'', which refers to the pseudolemma of the chunk
4946 \nota{MG: is it not "lexical pseudoform"?}, which is treated as a word
4947 by itself in order to be able to consult its information and make
4948 decisions from this.
4949
4950 \item \texttt{side} \textit{(only in the \texttt{chunker} module)}
4951 specifies if the selected \emph{clip} is from the source language
4952 (\texttt{sl}) or from the target language (\texttt{tl}).
4953
4954 \item \texttt{part} indicates which part of the lexical form is
4955 processed; generally its value is one of the attributes defined in
4956 \texttt{<\textbf{section-def-\\attrs}>} (\texttt{gen}, \texttt{nbr},
4957 etc.), although it can also take four predefined values: \texttt{lem}
4958 (refers to the lemma of the lexical form), \texttt{lemh} (the first
4959 part of a split lemma), \texttt{lemq} (the queue of a split lemma),
4960 and \texttt{whole} (the whole lexical form, including lemma and all
4961 grammatical symbols, which may have been modified in the preceding
4962 part of the rule).
4963
4964 \item \texttt{link-to} \textit{(only in the \texttt{chunker} module in
4965   advanced mode)} replaces the value that would result from consulting
4966   the rest of the attributes of the clip, by the value specified in
4967   this attribute, which must be a natural number ($>0$). \nota{MG:
4968   explain the new characteristics - Sergio?} This number indicates to
4969   which \texttt{<\textbf{tag}>} of the \texttt{<\textbf{chunk}>} is
4970   linked the clip content, the number being the order this tag
4971   occupies inside the element \texttt{<\textbf{tags}>}. The other
4972   attributes of the clip remain only for informational purposes, since
4973   they are overwritten by the value of the linked tag. An example of
4974   its use can be found in Figure \ref{fig:chunkintrachunk}.
4975
4976 \end{itemize}
4977
4978
4979 \begin{figure}
4980 \begin{small}
4981 \begin{alltt}
4982     <\textbf{test}>
4983       <\textbf{not}>
4984         <\textbf{equal}>
4985           <\textbf{clip} \textsl{pos}="2" \textsl{side}="tl" \textsl{part}="gen"/>
4986           <\textbf{clip} \textsl{pos}="2" \textsl{side}="sl" \textsl{part}="gen"/>
4987         <\textbf{/equal}>
4988       <\textbf{/not}>
4989     <\textbf{/test}>
4990 \end{alltt}
4991 \end{small}
4992 \caption{Extract from a rule where it is tested whether the TL
4993 (\texttt{tl}) gender (\texttt{gen}) of the second lexical unit
4994 identified in a pattern is different from the gender of the same
4995 lexical unit in the SL (\texttt{sl})}.
4996 \label{fig:clip}
4997 \end{figure}
4998
4999
5000
5001 \subsubsection{Element for literal string \texttt{<lit>}} This element
5002 is used to specify the value of a literal string by means of the
5003 attribute \texttt{v}. For example, \texttt{<\textbf{lit}
5004 v=\texttt{"}andar\texttt{"}/>} represents the string \emph{andar}.
5005
5006
5007 \subsubsection{Element for tag value \texttt{<lit-tag>}} It is similar
5008 to the \texttt{<\textbf{lit}>} element, with the difference that it
5009 does not specify the value of a literal string but the value of a
5010 grammatical symbol or tag, by means of the attribute \texttt{v}. An
5011 example of its use can be found in Figure \ref{fig:lit-tag}.
5012
5013
5014 \begin{figure}
5015 \begin{small}
5016 \begin{alltt}
5017 <\textbf{equal}>
5018   <\textbf{clip} \textsl{pos}="2" \textsl{side}="tl" \textsl{part}="nbr"/>
5019   <\textbf{lit-tag} \textsl{v}="ND"/>
5020 <\textbf{/equal}>
5021 \end{alltt}
5022 \end{small}
5023 \caption{Use of the element \texttt{<\textbf{lit-tag}>}: it is tested
5024   whether the number (\texttt{nbr}) symbol of the second
5025   lexical unit in the TL (\texttt{tl}) is \texttt{ND} (number to be
5026   determined)}
5027 \label{fig:lit-tag}
5028 \end{figure}
5029
5030 \begin{figure}
5031 \begin{small}
5032 \begin{alltt}
5033    <\textbf{test}>
5034     <\textbf{or}>
5035       <\textbf{not}>
5036         <\textbf{equal}>
5037           <\textbf{clip} \textsl{pos}="1" \textsl{side}="sl" \textsl{part}="gen"/>
5038           <\textbf{clip} \textsl{pos}="3" \textsl{side}="sl" \textsl{part}="gen"/>
5039         <\textbf{/equal}>
5040       <\textbf{/not}>
5041       <\textbf{not}>
5042         <\textbf{equal}>
5043           <\textbf{clip} \textsl{pos}="2" \textsl{side}="sl" \textsl{part}="gen"/>
5044           <\textbf{clip} \textsl{pos}="3" \textsl{side}="sl" \textsl{part}="gen"/>
5045         <\textbf{/equal}>
5046       <\textbf{/not}>
5047     <\textbf{/or}>
5048   <\textbf{/test}>
5049 \end{alltt}
5050 \end{small}
5051 \caption{Extract from a rule where it is tested whether the SL gender
5052   of the first or the second lexical unit matched in a pattern (it
5053   could be, for example, determiner--adjective--noun) is different
5054   from the gender of the third lexical unit also in the SL.}
5055 \label{fig:ornot}
5056 \end{figure}
5057
5058
5059
5060 \subsubsection{Element for variable \texttt{<var>}}
5061
5062
5063 Each \texttt{<\textbf{var}>} is a variable identifier: the mandatory
5064 attribute \texttt{n} specifies its name as has been defined in
5065 \texttt{<\textbf{section-def-vars}>}. When it appears in an
5066 \texttt{<\textbf{out}>}, a \texttt{<\textbf{test}>}, or the right part
5067 of a \texttt{<\textbf{let}>}, it represents the value of the variable;
5068 when it appears on the left side of a \texttt{<\textbf{let}>}, in an
5069 \texttt{<\textbf{append}>} or in a \texttt{<\textbf{modify-case}>}, it
5070 represents the reference of the variable and its value can be changed.
5071
5072 \subsubsection{Element for reference to string list \texttt{<list>}}
5073
5074 This element is only used as the second parameter of a
5075 \texttt{<\textbf{in}>} search.  The \texttt{n} attribute refers to the
5076 specific list defined in the string lists definition section
5077 \texttt{<\textbf{section-def-lists}>}. An example of its use can be found in
5078 Figure \ref{fig:in}.
5079
5080
5081 \begin{figure}
5082 \begin{small}
5083 \begin{alltt}
5084     <\textbf{rule}>
5085       <\textbf{pattern}>
5086         <\textbf{pattern-item} \textsl{n}="verb"/>
5087         <\textbf{pattern-item} \textsl{n}="a"/>
5088       <\textbf{/pattern}>
5089       <\textbf{action}>
5090       <\textbf{choose}>
5091         <\textbf{when}>
5092           <\textbf{test}>
5093             <\textbf{in} \textsl{caseless}="yes"/>
5094               <\textbf{clip} \textsl{pos}="1" \textsl{side}="sl" \textsl{part}="lem"/>
5095               <\textbf{list} \textsl{n}="verbos_est"/>
5096             <\textbf{/in}>
5097           <\textbf{/test}>
5098           <\textbf{let}>
5099             <\textbf{clip} \textsl{pos}="2" \textsl{side}="tl" \textsl{part}="lem"/>
5100             <\textbf{lit} \textsl{v}="en"/>
5101          <\textbf{/let}>
5102       <\textbf{/when}>
5103       <!-- ... -->
5104 \end{alltt}
5105 \end{small}
5106 \caption{Extract of a rule that detects a pattern made of a verb and
5107   the preposition \emph{a}, and then testes whether the verb (the
5108   lemma indicated in \texttt{lem}) of the source language
5109   (\texttt{sl}) is one of the lemmas included in the list of state
5110   verbs (defined in Figure \ref{fig:deflist}). If that be the case,
5111   the lemma of the second word in target language (\texttt{tl}) is
5112   changed to \emph{en}.}
5113 \label{fig:in}
5114 \end{figure}
5115
5116
5117 \subsubsection{Element for case application \texttt{<get-case-from>}}
5118
5119 The \texttt{<\textbf{get-case-from}>} element represents the string
5120 obtained after applying the letter case state of the lemma of a SL
5121 lexical unit to a string (\emph{clip}, \emph{lit} or \emph{var}).  To
5122 refer to the lexical unit from where the information is taken, the
5123 attribute \texttt{pos} is used, which indicates the position of that
5124 unit in the SL. This element is useful when the lexical units in a
5125 pattern are reordered, or when a lexical unit is added or deleted. You
5126 can see an example of its use in Figure \ref{fig:case}, which displays
5127 a rule to transform the simple perfect preterite tense in Spanish
5128 (\emph{dije}, "I said") into the compound form in Catalan (\emph{vaig
5129 dir}). In this rule, a LF with lemma \emph{anar} and grammatical
5130 symbol \emph{vaux} ("auxiliary verb") is added; it has to take the
5131 case information from the Spanish verb (which has position "1" in the
5132 pattern), so that the system translates \emph{Dije} as \emph{Vaig
5133 dir}, \emph{dije} as \emph{vaig dir} and \emph{DIJE} as \emph{VAIG
5134 DIR}.
5135
5136
5137 \subsubsection{Element for case pattern query \texttt{<case-of>}}
5138
5139 It is used to get the case pattern of a string, that is, one of the
5140 values "\texttt{aa}", "\texttt{Aa"} or "\texttt{AA}". It works like the
5141 \texttt{<\textbf{clip}>} element, since it has the same attributes:
5142 \texttt{pos}, the position of the word in the matched pattern;
5143 \texttt{part}, the specific attribute that we refer to (normally the
5144 lemma), which has the predefined attributes described in Section
5145 \ref{ss:clip}, and finally, only in the \texttt{chunker} module, the
5146 attribute \texttt{side}, referring to the translation side,
5147 \texttt{sl} or \texttt{tl}. In Figure \ref{fig:case} you can see this
5148 element in use, and you can find a more detailed description of this
5149 example in the following Section (description of
5150 \texttt{<\textbf{modify-case}>}).
5151
5152
5153 \subsubsection{Element for case modification \texttt{<modify-case>}}
5154
5155 This instructions is used to modify the case of the first parameter
5156 (usually a lemma) by means of the second parameter (a literal or a
5157 variable). The first parameter can be a \texttt{<\textbf{var}>}, a
5158 \texttt{<\textbf{clip}>} or a \texttt{<\textbf{case-of}>}, whereas the
5159 second one can be anything that delivers a value, but in principle it
5160 will be a \texttt{<\textbf{var}>} or a \texttt{<\textbf{lit}>}.  The
5161 values that this value can take are usually ``\texttt{Aa}'', to
5162 express that the ``left part'' of this case modification must have the
5163 first letter in upper case and the rest in lower case, ``\texttt{aa}''
5164 to put all in lower case, and ``\texttt{AA}'' to put all in upper
5165 case.
5166
5167 Figure \ref{fig:case} shows a rule where this element is used. It
5168 modifies in this rule the case of the TL lemma in position "1", which
5169 corresponds to \emph{dir}, because, although in the rule output this
5170 verb is the second lexical form (\emph{vaig dir}), it is actually the
5171 translation of the LF which has position 1 in the SL, and, therefore,
5172 it retains the same assigned position in the TL. This lemma is
5173 assigned the value ``\texttt{aa}'' in the case that the SL lemma has
5174 the state ``\texttt{Aa}''. There is nothing to specify for the rest of
5175 the cases, since the case state of the LF in position 1 will be the
5176 same in the SL and in the TL and, therefore, will be automatically
5177 transferred (see Section~\pageref{mayusc} to obtain more information
5178 on letter case handling in dictionaries ).
5179
5180
5181 \begin{figure}
5182 \begin{small}
5183 \begin{alltt}
5184 <\textbf{rule}>
5185   <\textbf{pattern}>
5186     <\textbf{pattern-item} n="pretind"/>
5187   <\textbf{/pattern}>
5188   <\textbf{action}>
5189     <\textbf{out}>
5190       <\textbf{lu}>
5191          <\textbf{get-case-from} pos ="1">
5192            <\textbf{lit} v="anar"/>
5193          <\textbf{/get-case-from}>
5194          <\textbf{lit-tag} v="vaux"/>
5195          <\textbf{clip} pos="1" side="sl" part="persona"/>
5196          <\textbf{clip} pos="1" side="sl" part="nbr"/>
5197        <\textbf{/lu}>
5198        <\textbf{b/}>
5199      <\textbf{/out}>
5200      <\textbf{choose}>
5201        <\textbf{when}>
5202          <\textbf{test}>
5203            <\textbf{equal}>
5204               <\textbf{case-of} pos="1" side="sl" part="lemh"/>
5205               <\textbf{lit} v="Aa"/>
5206            <\textbf{/equal}>
5207          <\textbf{/test}>
5208          <\textbf{modify-case}>
5209              <\textbf{case-of} pos="1" side="tl" part="lemh"/>
5210              <\textbf{lit} v="aa"/>
5211          <\textbf{/modify-case}>
5212        <\textbf{/when}>
5213      <\textbf{/choose}>
5214      <\textbf{out}>
5215        <\textbf{lu}>
5216           <\textbf{clip} pos="1" side="tl" part="lemh"/>
5217           <\textbf{clip} pos="1" side="tl" part="a_verb"/>
5218           <\textbf{lit-tag} v="inf"/>
5219           <\textbf{clip} pos="1" side="tl" part="lemq"/>
5220        <\textbf{/lu}>
5221     <\textbf{/out}>
5222   <\textbf{/action}>
5223 <\textbf{/rule}>
5224 \end{alltt}
5225 \end{small}
5226 \caption{Rule for the translation from Spanish into Catalan, which
5227   turns the verbs in simple perfect preterite tense (\emph{dije}) into
5228   the
5229   compound perfect preterite tense usual in Catalan (\emph{vaig dir}),
5230     and at the same time assigns the appropriate case state
5231   to the two resulting words.}
5232 \label{fig:case}
5233 \end{figure}
5234
5235
5236
5237 \subsubsection{Element for assignment \texttt{<let>}}
5238
5239 The assignment instruction \texttt{<\textbf{let}>} assigns the value
5240 of the right part of the assignment (a literal string, a
5241 \texttt{clip}, a variable, etc.) to the left part (a \texttt{clip}, a
5242 variable, etc.). An example of its use can be found in Figure
5243 \ref{fig:regla}.
5244
5245
5246
5247 \begin{figure}
5248 \begin{small}
5249 \begin{alltt}
5250 <\textbf{rule}>
5251   <\textbf{pattern}>
5252     <\textbf{pattern-item} n="det"/>
5253     <\textbf{pattern-item} n="nom"/>
5254   <\textbf{/pattern}>
5255   <\textbf{action}>
5256       <\textbf{choose}>
5257         <\textbf{when}>
5258           <\textbf{test}>
5259             <\textbf{and}>
5260               <\textbf{not}>
5261                 <\textbf{equal}>
5262                   <\textbf{clip} pos="2" side="tl" part="gen"/>
5263                   <\textbf{clip} pos="2" side="sl" part="gen"/>
5264                 <\textbf{/equal}>
5265               <\textbf{/not}>
5266               <\textbf{not}>
5267                 <\textbf{equal}>
5268                   <\textbf{clip} pos="2" side="tl" part="gen"/>
5269                   <\textbf{lit-tag} v="mf"/>
5270                 <\textbf{/equa}l>
5271               <\textbf{/not}>
5272               <\textbf{not}>
5273                 <\textbf{equal}>
5274                   <\textbf{clip} pos="2" side="tl" part="gen"/>
5275                   <\textbf{lit-tag} v="GD"/>
5276                 <\textbf{/equal}>
5277               <\textbf{/not}>
5278             <\textbf{/and}>
5279           <\textbf{/test}>
5280           <\textbf{let}>
5281             <\textbf{clip} pos="1" side="tl" part="gen"/>
5282             <\textbf{clip} pos="2" side="tl" part="gen"/>
5283           <\textbf{/let}>
5284         <\textbf{/when}>
5285       <\textbf{/choose}>
5286       <!-- Other gender and number agreement actions -->
5287 \end{alltt}
5288 \end{small}
5289 \caption{Extract from a rule for the pattern \texttt{determiner--noun}
5290   (continues in Fig. \ref{fig:regla2}): in this part of the rule, the
5291   gender of the noun is assigned to the determiner in the case that
5292   the gender of the noun changes from the SL (\texttt{sl}) to the TL
5293   (\texttt{tl}) during the lexical transfer process between both
5294   languages.}
5295 \label{fig:regla}
5296 \end{figure}
5297
5298 \subsubsection{Element for string concatenation \texttt{<concat>}}
5299
5300 This element is used to concatenate strings in order to assign them to
5301 a variable. It is used in combination with \texttt{<\textbf{let}>},
5302 and the previous value of the variable is lost with the assignment of
5303 \texttt{<\textbf{concat}>}.
5304
5305 It does not have any attribute. It can contain any instruction that
5306 delivers a string, such as \texttt{<\textbf{lit}>},
5307 \texttt{<\textbf{lit-tag}>} or \texttt{<\textbf{clip}>}.
5308
5309 Figure \ref{fig:concat} shows an example of its use.
5310
5311
5312 \begin{figure}
5313 \begin{small}
5314 \begin{alltt}
5315 <\textbf{let}>
5316   <\textbf{var} n="palabra"/>
5317     <\textbf{concat}>
5318        <\textbf{clip} pos="3" side="tl" part="lem"/>
5319        <\textbf{lit-tag} v="adj"/>
5320     <\textbf{/concat}>
5321 <\textbf{/let}>
5322 \end{alltt}
5323 \end{small}
5324 \caption{In this example, the variable \texttt{palabra} is assigned
5325 the value of the concatenation of a \texttt{clip} (the lemma in
5326 position 3) and the \emph{adj} tag.}
5327 \label{fig:concat}
5328 \end{figure}
5329
5330
5331
5332
5333 \subsubsection{Element for string concatenation \texttt{<append>}}
5334
5335 The \texttt{<\textbf{append}>} instruction can be used to save the
5336 output of an action before printing it in the corresponding
5337 \texttt{<\textbf{out}>}, if required by the designer of the transfer
5338 rules.
5339
5340 The mandatory attribute \texttt{n} specifies the name of the variable
5341 used. After applying the instruction, the previous content of the
5342 referred variable will be the prefix of the new content, that is, the
5343 new content inserted in the \texttt{<\textbf{append}>} will be
5344 concatenated to the pre-existing content of the variable specified in
5345 \texttt{n}.
5346
5347 The content of this instruction can be one or more of the following
5348 tags: \texttt{<\textbf{b}>}, \texttt{<\textbf{clip}>},
5349 \texttt{<\textbf{lit}>}, \texttt{<\textbf{lit-tag}>},
5350 \texttt{<\textbf{var}>}, \texttt{<\textbf{get-case-from}>},
5351 \texttt{<\textbf{case-of}>} or \texttt{<\textbf{concat}>}. There is an
5352 example of its use in Figure \ref{fig:append}.
5353
5354 \begin{figure}
5355 \begin{small}
5356 \begin{alltt}
5357 <\textbf{append} n="temporal">
5358   <clip pos="3" part="gen" side="tl"/>
5359 <\textbf{/append}>
5360 \end{alltt}
5361 \end{small}
5362 \caption{In this example, the variable \texttt{temporal} is assigned
5363 the value of the gender, in the TL, of the third word matched by the
5364 rule.}
5365 \label{fig:append}
5366 \end{figure}
5367
5368
5369
5370
5371 \subsubsection{Element for output \texttt{<out>}}
5372
5373 \label{ss:out} The output instruction is used to specify the lexical
5374 forms that are sent at the output of the module after having been
5375 applied the required structural transfer operations. Its use is
5376 different according to the module. On the one hand, its use in the
5377 \texttt{chunker} module when it runs as only module (shallow-transfer)
5378 and its use in the \texttt{postchunk} module are similar, since in
5379 both cases, the output must be the input for the generator.  The
5380 \texttt{chunker} in Apertium 2 and the \texttt{interchunk} have
5381 different use modes: the former to create the chunks, and the latter
5382 to modify the chunks without modifying its internal part.
5383
5384 \begin{enumerate}
5385
5386 \item \textbf{Use in \texttt{chunker} in shallow-transfer mode, and in
5387 \texttt{postchunk}}
5388
5389   The instruction sends each lexical form inside a
5390   \texttt{<\textbf{lu}>} set, which in turn can be contained inside a
5391   \texttt{<\textbf{mlu}>} element when the output is a multiword made
5392   of two or more LF. Besides, also the blanks or superblanks
5393   (\texttt{<\textbf{b}>}) between LF and LF are sent. You can find an
5394   example of its use in Figures \ref{fig:case} and \ref{fig:regla2}.
5395
5396 \begin{figure}
5397 \begin{small}
5398 \begin{alltt}
5399     <!-- ... -->
5400     <\textbf{out}>
5401       <\textbf{lu}>
5402          <\textbf{clip} pos="1" side="tl" part="whole"/>
5403       <\textbf{/lu}>
5404       <\textbf{lu}>
5405          <\textbf{clip} pos="2" side="tl" part="whole"/>
5406       <\textbf{/lu}>
5407     <\textbf{/out}>
5408   <\textbf{/process}>
5409  <\textbf{/action}>
5410 <\textbf{/rule}>
5411 \end{alltt}
5412 \end{small}
5413 \caption{Extract from a rule (comes from Fig. \ref{fig:regla}). At the
5414   end of the rule, and after different actions, the resulting data are
5415   sent by means of the attribute \texttt{whole}, which contains the
5416   lemma and the grammatical symbols of each LF (positions 1 and 2 in
5417   the pattern).}
5418 \label{fig:regla2}
5419 \end{figure}
5420
5421
5422 \item \textbf{Use in \texttt{chunker} in advanced mode}
5423
5424   The output of this module is expected to be a sequence of one or
5425   more chunks (sent inside a \texttt{<\textbf{chunk}>} element)
5426   separated by blanks \texttt{<\textbf{b}>}. Lexical forms and
5427   multiforms, as well as the blanks between them, are sent inside
5428   chunks. You can see in Figure \ref{fig:chunkintrachunk} an example
5429   of use.
5430
5431
5432 \begin{figure}
5433 \begin{small}
5434 \begin{alltt}
5435 <\textbf{out}>
5436   <\textbf{chunk} name="pr" case="caseFirstWord">
5437     <\textbf{tags}>
5438       <\textbf{tag}><\textbf{lit-tag} v="PREP"/><\textbf{/tag}>
5439     <\textbf{/tags}>
5440     <\textbf{lu}>
5441       <\textbf{clip} pos="1" side="tl" part="whole"/>
5442     <\textbf{/lu}>
5443   <\textbf{/chunk}>
5444   <\textbf{b} pos="1"/>
5445   <\textbf{chunk} name="probj" case="caseOtherWord">
5446     <\textbf{tags}>
5447       <\textbf{tag}><\textbf{lit-tag} v="NP"/><\textbf{/tag}>
5448       <\textbf{tag}><\textbf{lit-tag} v="tn"/><\textbf{/tag}>
5449       <\textbf{tag}><\textbf{clip} pos="2" side="tl" part="pers"/><\textbf{/tag}>
5450       <\textbf{tag}><\textbf{clip} pos="2" side="tl" part="gen"/><\textbf{/tag}>
5451       <\textbf{tag}><\textbf{clip} pos="2" side="tl" part="nbr"/><\textbf{/tag}>
5452     <\textbf{/tags}>
5453     <\textbf{lu}>
5454       <\textbf{clip} pos="2" side="tl" part="lem"/>
5455       <\textbf{lit-tag} v="prn"/>
5456       <\textbf{lit-tag} v="2"/>
5457       <\textbf{clip} pos="2" side="tl" part="pers"/>
5458       <\textbf{clip} pos="2" side="tl" part="gen" link-to="4"/>
5459       <\textbf{clip} pos="2" side="tl" part="nbr" link-to="5"/>
5460     <\textbf{/lu}>
5461   <\textbf{/chunk}>
5462 <\textbf{/out}>
5463 \end{alltt}
5464 \end{small}
5465 \caption{Output instruction that sends two chunks separated by a
5466   blank. The printed sequence is a preposition followed by a noun
5467   phrase ("NP"). The tags that are linked from the second chunk to the outside are
5468   pronoun type ("tn"), gender and number of the noun phrase
5469   (pronoun). The \texttt{<\textbf{tag}>} elements are used to specify
5470   the tags of the chunk, and the value of the attributes \texttt{name}
5471   and \texttt{case} is used to specify the pseudolemma of the chunk.}
5472 \label{fig:chunkintrachunk}
5473 \end{figure}
5474
5475
5476 \item \textbf{Use in \texttt{interchunk}}
5477
5478   In this module, lexical forms (words) are inaccessible, since it is
5479   only possible to operate with chunks and, therefore, inside an
5480   \texttt{<\textbf{out}>} element you can only put
5481   \texttt{<\textbf{chunk}>} elements or blanks \texttt{<\textbf{b}>}.
5482   The information on lemma and tags specified here in a \texttt{<\textbf{chunk}>}
5483   element refers exclusively to the lemma (pseudolemma) and the tags of
5484   the chunk.
5485
5486 An example of its use can be found in Figure
5487 \ref{fig:chunkinterchunk}.
5488
5489 \begin{figure}
5490 \begin{small}
5491 \begin{alltt}
5492 <\textbf{out}>
5493   <\textbf{b} pos="1"/>
5494   <\textbf{chunk}>
5495     <\textbf{clip} pos="2" part="lem"/>
5496     <\textbf{clip} pos="2" part="tags"/>
5497     <\textbf{clip} pos="2" part="chcontent"/>
5498   <\textbf{/chunk}>
5499 <\textbf{/out}>
5500 \end{alltt}
5501
5502 \end{small}
5503 \caption{The aim of this rule output is to discard the first chunk of
5504   the matched pattern (pronoun drop). The three
5505   \texttt{<\textbf{clip}>} elements have been included here for
5506   illustrative purposes, since they could have been replaced by the
5507   \texttt{part="whole"} which would group them in a single
5508   \texttt{<\textbf{clip}>} .}
5509 \label{fig:chunkinterchunk}
5510 \end{figure}
5511
5512
5513
5514 \end{enumerate}
5515
5516
5517
5518
5519 \subsubsection{Element for lexical unit \texttt{<lu>}}
5520
5521 \label{ss:lu} This is the element by means of which each TLLF is sent out at the
5522 end of a rule, inside an \texttt{<\textbf{out}>} element.
5523 With this element, one can send the whole lexical form, using the
5524 attribute \texttt{whole} of a \texttt{<\textbf{clip}>}, or, if
5525 required, specify its parts separately (lemma plus tags, indicated by
5526 means of \texttt{<\textbf{clip}>} strings, literal strings
5527 \texttt{<\textbf{lit}>}, tags \texttt{<\texttt{\textbf{lit-tag}}>},
5528 variables \texttt{<\texttt{\textbf{var}}>}, besides case information
5529 [\texttt{<\textbf{get-case-from}>}, \texttt{<\textbf{case-of}>}]).
5530
5531
5532
5533 Please note that, as has been explained before, in the case of
5534 multiwords with \emph{split lemma} it is necessary to replace the
5535 lemma queue \emph{after} the grammatical symbols of the inflected word
5536 (or lemma head), because the \texttt{pretransfer} module has moved the
5537 queue to put it before the grammatical symbols of the head.  This
5538 replacement is done here, inside the \texttt{<\textbf{lu}>} element,
5539 using the values \texttt{lemh} and \texttt{lemq} of the attribute
5540 \texttt{part} in a \texttt{<\textbf{clip}>}. The \texttt{lemh}
5541 attribute refers to the lemma head, and \texttt{lemq} to the lemma
5542 queue. As can be seen in the example \ref{fig:case}, the \texttt{lemq}
5543 part of a \texttt{<\textbf{clip}>} is placed after the lemma head and
5544 all the grammatical symbols that follow it.  This rule would be
5545 suitable, for example, for the Spanish form \emph{eché de menos} ("I
5546 missed"), which has to be translated into Catalan as \emph{vaig trobar
5547 a faltar}. The attribute \texttt{a\_verb} which comes after
5548 \texttt{lemh} contains the grammatical symbol that describes the verb
5549 category (\emph{vblex}, \emph{vbser}, \emph{vbhaver} or \emph{vbmod}
5550 as applicable). Therefore, the last lexical form sent by this rule, in
5551 the case of \emph{vaig trobar a faltar}, would be, in the data stream:
5552 \begin{alltt} ^trobar<vblex><inf># a faltar\$ \end{alltt}
5553
5554 \noindent The number sign \texttt{\#} in the data stream corresponds
5555 to the \texttt{<\textbf{g}>} element in dictionaries, used to signal
5556 the position of the invariable part in a split lemma multiword.
5557
5558 It is important to note that the attributes included in
5559 \texttt{<\textbf{lu}>} may be empty. So, a verb matched by the rule in
5560 Fig. \ref{fig:case} which is not a split lemma multiword, will be sent
5561 with an empty \texttt{lemq} attribute, since the verb does not have
5562 lemma queue. This way it is not necessary to define different rules
5563 for lexical forms with and without queue. You can find another example
5564 of this in page \pageref{regla_verbo1}, where the rule for verb sends
5565 in a \texttt{<\textbf{lu}>} the attributes \texttt{gen}
5566 (\emph{gender}) and \texttt{nbr} (\emph{number}). This way, it
5567 includes participles (with gender and number) and the rest of verb
5568 forms (which will have these attributes empty).
5569
5570 In the same page you can see a rule for a verb followed by an enclitic
5571 pronoun. Here, the lemma queue is placed after the enclitic pronoun;
5572 so, for a split lemma multiword joined to an enclitic pronoun, such as
5573 \emph{echándote de menos}, the output in the data stream would be,
5574 when translating into Catalan:
5575
5576 \begin{alltt} ^trobar<vblex><ger>+et<prn><enc><p2><mf><sg># a faltar\$
5577 \end{alltt}
5578
5579 Of course, this rule works also for verbs which do not belong to this
5580 multiword type; so, the form \emph{explicándote} ("explaining to you")
5581 would be output, when translating from Spanish to Catalan:
5582
5583
5584 \begin{alltt} ^explicar<vblex><ger>+et<prn><enc><p2><mf><sg>\$
5585 \end{alltt}
5586
5587 As for the attribute \texttt{whole} of a \texttt{<\textbf{clip}>}, it
5588 must be taken into account that it can be used to send the whole
5589 lexical form only in the case that the sent word can not be a
5590 multiword, that is, can not contain a split lemma.  Compare figures
5591 \ref{fig:case} and \ref{fig:regla2}. The \texttt{whole} attribute can
5592 be used in the second example because it contains the lemma
5593 \texttt{lem} plus all the morphological tags of the lexical forms in
5594 position 1 and 2 (determiner and noun). \nota{but nouns can also be mw now!}Contrarily, in the first
5595 example, the lexical form in \texttt{<\textbf{lu}>} is sent in parts,
5596 with a \texttt{lemh} (lemma head) and a \texttt{lemq} (lemma queue),
5597 since it may occur that the verb matched in the pattern is a multiword
5598 with split lemma. In practice, in our system this means that the
5599 \texttt{whole} attribute can be used to send any kind of lexical form
5600 except verbs and nouns, because we defined multiwords with inner
5601 inflection only for verbs and nouns.
5602
5603 \subsubsection{Element for lexical unit \texttt{<mlu>}}
5604 \label{ss:mlu}
5605
5606 Its name derives from \emph{multilexical unit}; it is used inside the
5607 \texttt{<\textbf{out}>} element to output multiwords consisting of
5608 more than one lexical form. Each lexical form in a
5609 \texttt{<\textbf{mlu}>} is sent inside a \texttt{<\textbf{lu}>}
5610 element. On the output of the module, lexical forms contained in this
5611 element will be joined to each other by the symbol '+' in the data
5612 stream. This means that they will become a multiword made of different
5613 lexical forms, which will be treated as a single unit by the
5614 subsequent modules; therefore, the generation dictionary will have to
5615 contain an entry for this multiword in order for it to be generated.
5616
5617 In our system, this element is used to join enclitic pronouns to
5618 conjugated verbs.
5619
5620 \subsubsection{Element for chunk encapsulation \texttt{<chunk>}}
5621
5622 This is the element in which chunks are sent, in an
5623 \texttt{<\textbf{out}>} element, on the output of the module.  It is
5624 only used in the \texttt{chunker} module in advanced mode, and in the
5625 \texttt{interchunk} module.  It is not used in the \texttt{postchunk}
5626 module because its output does not contain any chunk. Neither it is
5627 used in the \texttt{chunker} module in shallow-transfer mode, because
5628 its output does not contain chunks but individual lexical units and
5629 blanks.
5630
5631 \begin{enumerate}
5632
5633 \item \textbf{Use in \texttt{chunker} in advanced mode}
5634
5635
5636 In this mode, the \texttt{<\textbf{chunk}>} element must have an
5637 attribute \texttt{name}, which is the lemma of the chunk, or an
5638 attribute \texttt{namefrom} which refers to a variable previously
5639 defined, whose value will be used as the lemma of the chunk. Besides,
5640 it can include the attribute \texttt{case} to specify which variable
5641 is the case policy taken from (for example, a value obtained with the
5642 instruction \texttt{<\textbf{case-from}>}).
5643
5644 An example of its use can be found in Figure
5645 \ref{fig:chunkintrachunk}.
5646
5647
5648 \item \textbf{Use in \texttt{interchunk}}
5649
5650   In this module, the \texttt{<\textbf{chunk}>} element does not
5651 specify any attribute; it is used just as the \texttt{<\textbf{lu}>}
5652 element is used in the shallow-transfer or in the \texttt{postchunk}
5653 to delimit the lexical forms.  The elements it sends are (generally in
5654 a \texttt{<\textbf{clip}>} instruction): the lemma of the chunk
5655 (\texttt{lem}), its tags (\texttt{tags}) and the chunk content
5656 (\texttt{chcontent}, contains LF plus blanks), which is an invariable
5657 part since it can not be accessed from the \texttt{interchunk} module.
5658 The invariable part of the chunk is sent at the end.  You can also use
5659 the \texttt{whole} attribute to send the whole chunk (lemma, tags and
5660 invariable content).
5661
5662   An example of its use can be found in Figure
5663   \ref{fig:chunkinterchunk}.
5664
5665 \end{enumerate}
5666
5667 \subsubsection{Element for tag links section \texttt{<tags>}}
5668
5669 \textit{Only in chunker in advanced mode}.
5670
5671 This element is used to specify a list of tags, or
5672 \texttt{<\textbf{tag}>} elements, which will become the pseudotags of
5673 the chunk. It does not have attributes, and must be included as first
5674 item inside the \texttt{<\textbf{chunk}>} element. See Figure
5675 \ref{fig:chunkintrachunk}.
5676
5677
5678 \subsubsection{Element for tag link \texttt{<tag>}}
5679
5680 \textit{Only in chunker in advanced mode}.
5681
5682 The \texttt{<\textbf{tag}>} element must contain a morphological tag,
5683 which can be specified by means of a \texttt{<\textbf{clip}>}
5684 instruction or a literal tag \texttt{<\textbf{lit-tag}>}. It does not
5685 have attributes.
5686
5687 The tag or tags specified this way in a chunk will become the
5688 grammatical symbols of the chunk; the next module,
5689 \texttt{interchunk}, will be able to use them to test and modify the
5690 characteristics of the chunks.
5691
5692
5693 \subsubsection{Element for blank \texttt{<b>}}
5694
5695 The \texttt{<\textbf{b}>} element refers to [super]blanks and is
5696 indexed by the attribute \texttt{pos}. For example, a
5697 \texttt{<\textbf{b}>} with \texttt{pos="2"} refers to the
5698 [super]blanks (including format data encapsulated by the de-formatter)
5699 between the 2nd SLLF and the 3rd SLLF. The explicit management of
5700 [super]blanks enables the correct placement of format when the result
5701 of the structural transfer has more or less elements than the
5702 original, or when it has been reordered in some way.
5703
5704
5705 \subsection{Specification of the three modules that build an advanced
5706 transfer system}
5707 \label{noutransfer}
5708
5709 In the following lines we describe the differences between the rule
5710 format in the three modules of an advanced transfer system. When
5711 Apertium works as a shallow-transfer system, the only module to be run
5712 is the first one, called \texttt{chunker}, which communicates directly
5713 with the generation module.
5714
5715
5716 \subsubsection{\texttt{Chunker} module}
5717 \label{ss:chunker}
5718
5719
5720 This module can be used alone as a shallow-transfer system, or in
5721 combination with the other two transfer modules to build an advanced
5722 transfer system. An attribute of the \texttt{<transfer>} element
5723 controls its run mode.
5724
5725 \paragraph{Input/output}
5726
5727 \begin{itemize}
5728 \item Input: data in the \texttt{pretransfer} output format, that is,
5729 with invariable queues of multiwords moved to the position right
5730 before the first grammatical symbol.
5731
5732 \item Output:
5733 \begin{itemize}
5734 \item[-] in advanced mode (in an advanced transfer system): chunks,
5735 that will be detected and processed by the next module
5736 \item[-] in shallow-transfer mode (in a shallow-transfer system):
5737 lexical forms, that will be the input of the generation module.
5738 \end{itemize}
5739
5740 \end{itemize}
5741
5742
5743 \paragraph{Data files}
5744
5745 \nota{Explicar millor això de l'únic fitxer de configuració}
5746
5747 This program uses a single configuration file and a precompiled file
5748 for pattern detection calculated from the former. The name of the
5749 pattern file (the configuration file) will have the extension
5750 \texttt{.t1x}.  Since the \texttt{chunker} is the program that looks
5751 up the bilingual dictionary, this dictionary (compiled) also has to be
5752 provided to the program.
5753
5754 \nota{Potser seria bona idea esmentar en quina secció s'explica el
5755 compilador a què es fa referència}
5756
5757 The DTD of this data file is specified in Appendix
5758 \ref{ss:dtdtransfer}, and the elements used to create the rules in the
5759 file are described in Section \ref{formatotransfer}.
5760
5761 \paragraph{Pattern matching}
5762
5763 The rule matching system in this module will be the one described in
5764 \ref{functransfer}, since it is the same in advanced transfer mode and
5765 in shallow-transfer mode. The \texttt{a\-per\-tium-pre\-trans\-fer}
5766 program \nota{Vacil·lació terminològica \texttt{pretransfer}.}  is
5767 needed to adapt the tagger output format to the input format required
5768 by the transfer module.  There is the possibility that, in later
5769 versions of Apertium, the \textit{part-of-speech tagger} is modified
5770 so that it does the work of \texttt{apertium-pretransfer}.
5771 \nota{També hem d'unificar la terminologia d'altres mòduls:
5772 \emph{desambiguador categorial}, \emph{etiquetador}; tal com està
5773 redactat el paràgraf es podria pensar que són dues coses diferents.}
5774
5775
5776 \paragraph{How it works}
5777
5778 The module works similarly in shallow-transfer mode and in advanced
5779 mode, with these differences:
5780
5781 \begin{itemize}
5782 \item If we want that the module works as the first module in an
5783 advanced transfer system, we must specify the value \texttt{chunk} in
5784 the optional attribute \texttt{default} of the root element
5785 \texttt{<transfer>}. The default value is \texttt{lu}, which implies
5786 that the \texttt{chunker} works in shallow-transfer mode (as a single
5787 module).
5788
5789 \item Chunk generation in the output: the \texttt{<chunk>} tag is an
5790 element one level higher than \texttt{<lu>} (\textit{lexical unit}),
5791 which generates chunks with the characteristics described in
5792 \ref{sec:format}; it has the following attributes:
5793
5794   \begin{itemize}
5795   \item \texttt{name} (optional): pseudolemma of the chunk. It
5796   contains a string that is identified as the pseudolemma of the
5797   chunk.
5798
5799   \item \texttt{namefrom} (optional): pseudolemma of the chunk,
5800   obtained from a variable. It is compulsory to specify whether
5801   \texttt{name} or \texttt{namefrom}.
5802
5803   \item \texttt{case} (optional): variable that is used to obtain the
5804   information on case from it and apply it to the lemma specified in
5805   \texttt{name} or in \texttt{namefrom}.
5806   \end{itemize}
5807
5808 \item Each chunk begins with a \texttt{<tags>} instruction, which does
5809 not allow any attribute, and which contains one or more individual
5810 instructions \texttt{<tag>}.
5811 \item Instructions \texttt{<tag>} do not have attributes. They can
5812 include any instruction that returns a string as a value:
5813 \texttt{<lit>}, \texttt{<var>} \nota{clip, lit-tag}.
5814 \item Instructions \texttt{<clip>} have an optional attribute:
5815 \texttt{link-to}, which is used to specify a tag \verb!<!\textit{value
5816 of link-to}\verb!>! that replaces \nota{Spanish: ``una etiqueta en
5817 lugar de'' (instead of) or ``additionally''?. Explain new aspects of
5818 link-to} the information specified by the \texttt{<clip>} in the rest
5819 of its attributes.\nota{No s'entén gaire bé - Not understandable} This
5820 information is dispensable but can be useful as information on the
5821 origin of the linguistic decision.
5822 \end{itemize}
5823
5824 The following is a use example of the \texttt{<chunk>} element :
5825
5826 \begin{alltt}
5827 <out>
5828   <chunk name="adj-noun" case="variableCase">
5829     <tags>
5830       <tag><lit-tag v="NP"/></tag>
5831       <tag><clip pos="2" side="tl" part="gen"/></tag>
5832       <tag><clip pos="2" side="tl" part="nbr"/></tag>
5833     </tags>
5834     <lu>
5835       <clip pos="2" side="tl" part="lemh"/>
5836       <clip pos="2" side="tl" part="a_noun"/>
5837       <clip pos="2" side="tl" part="gen" link-to="2"/>
5838       <clip pos="2" side="tl" part="nbr" link-to="3"/>
5839     </lu>
5840     <b pos="1"/>
5841     <lu>
5842       <var n="adjectiu"/>
5843       <clip pos="1" side="tl" part="lem"/>
5844       <clip pos="1" side="tl" part="a_adj"/>
5845       <clip pos="2" side="tl" part="gen" link-to="2"/>
5846       <clip pos="2" side="tl" part="nbr" link-to="3"/>
5847     </lu>
5848   </chunk>
5849 </out>
5850 \end{alltt}
5851
5852
5853 \paragraph{Default action}
5854
5855 Isolated \textit{superblanks} which are not detected by any pattern in
5856 this module, are written in the same order in which they arrive.
5857
5858 The default action for words not matched by any pattern
5859 is different depending on the transfer mode (that is, on the value of the
5860 optional attribute \texttt{default} of the root element \texttt{<transfer>}):
5861
5862 \begin{itemize}
5863 \item if the value is \texttt{chunk} (i.e. the module works in advanced
5864   mode): it will generate trivial chunks with the words not matched by
5865   any rule, so that in the output there are no words not included in a
5866   chunk.  The new chunk will be created with the translation of the
5867   word by the bilingual dictionary.  The fixed lemma of these
5868   implicitly created chunks is \texttt{default}.
5869 \item if the value is \texttt{lu} (default value; i.e. the module works as single
5870 module in a shallow-transfer system): it will not create chunks for
5871 words not matched by rules, they will just be translated using the
5872 bilingual dictionary.
5873
5874 \end{itemize}
5875
5876 The following is an automatically generated chunk for a lexical form
5877 not matched by any rule in the \texttt{chunker} module when the
5878 \texttt{default} attribute has the value \texttt{chunk}:
5879
5880
5881 \begin{alltt}
5882 ^default\verb!{!^that<cnjsub>$\verb!}!$
5883 \end{alltt}
5884
5885 \nota{Va sense etiquetes entre \texttt{default} i \texttt{\{}? No
5886 caldria dir-ho explícitament?}
5887
5888
5889 \subsubsection{\texttt{Interchunk} module}
5890 \label{ss:interchunk}
5891
5892
5893 \nota{\texttt{apertium-interchunk} or simply \texttt{interchunk}?}
5894
5895 The \texttt{interchunk} module processes chunks; it may reorder them
5896 and change its morphosyntactic information. This is done by detecting
5897 patterns of chunks (sequences of chunks).  The instructions that
5898 control how it works are, with little differences, the same used by
5899 the \texttt{chunker} module; they are written, however, in a different
5900 file. Chunks are processed here in a similar way as words are
5901 processed in the \texttt{chunker} of Apertium.  \nota{Comprovar la
5902 denominació dels programes}
5903
5904 \paragraph{Input/output}
5905
5906 \begin{itemize}
5907 \item Input: chunks from the \texttt{chunker}.
5908 \item Output: chunks possibly reordered and with the data on its
5909 pseudolemmas (lexical pseudoforms) possibly changed.
5910 \end{itemize}
5911
5912 \paragraph{Data files}
5913
5914 This module uses two data files. A specification file of the
5915 \texttt{in\-ter\-chunk} program, with extension \texttt{.t2x} by
5916 analogy with the previous module, and a file of precalculated patterns
5917 to accelerate the analysis of the input.  The binary file of the
5918 bilingual dictionary is not included because it is not used.
5919 \nota{Citar el compilador?}
5920
5921 The syntax of the specification file is very similar to that of the
5922 \texttt{chunker}. Its DTD is specified in Appendix
5923 \ref{ss:dtdinterchunk}, and the elements used to create the rules in
5924 the file are described in Section \ref{formatotransfer}.
5925
5926
5927 \paragraph{Pattern matching}
5928
5929 Rules detect patterns defined by sequences of lexical
5930 pseudoforms. These lexical pseudoforms have a format based on the
5931 format of lexical forms for words. In practice, a lexical pseudoform
5932 is seen equivalently as \nota{mlforcada: La alternança
5933 \emph{pseudolema} i \emph{pseudoparaula} s'ha de resoldre. MG: ho he
5934 traduit tot com a 'lexical pseudoform', crec que era aquest el
5935 sentit.} lexical forms are seen in the \texttt{chunker} regarding
5936 pattern matching.  This way, pattern matching will be based on
5937 attributes defined for lexical pseudoforms, not for lexical forms
5938 (words) of the original pattern.
5939
5940 \paragraph{How it works}
5941
5942 With regard to the set of instructions used in \texttt{chunker}, the
5943 changes on the set of instructions for this module are the following:
5944
5945 \begin{itemize}
5946 \item The root element is called \texttt{<interchunk>} and does not
5947 have any attribute.
5948 \item The attribute \texttt{side} disappears: This module does not use
5949 bilingual dictionaries; therefore, the attribute used to indicate
5950 whether the consulted side is SL or TL looses sense. This attribute
5951 was basically used in the \texttt{<out>} instructions.
5952 \item The \texttt{<chunk>} tag is used here without attributes, simply
5953 inside an \texttt{<out>} to delimit the output of chunks.
5954 \item The predefined attribute \texttt{lem} refers to the pseudolemma
5955 of the chunk. In the same way, the predefined attribute \texttt{tags}
5956 refers to the grammatical symbols or tags of the chunk. The chunk
5957 content becomes something like a queue which can be printed with the
5958 implicit attribute \texttt{chcontent}.\nota{Només imprimir o s'hi pot
5959 fer referència també?}  \nota{Dir de quin element són aquests
5960 atributs}
5961 \item All the values of \texttt{part}, except \texttt{chname}, access
5962 the pseudolemma and the tags of the chunk (not of individual words).
5963 \item Unlike what happens in the \texttt{chunker} module, in the rules
5964 of this module it is not allowed to print anything else than
5965 \texttt{<chunk>}s in the \texttt{<out>} instructions, in no case
5966 isolated words.\nota{MG: and blanks too, right?}
5967 \end{itemize}
5968
5969
5970 \paragraph{Default action}
5971
5972 Like in the previous module, a default action has been defined, which
5973 writes without modifications the chunks not matched by any pattern of
5974 the specification file. This default action writes exactly what it
5975 reads, be it chunks or blanks.  \nota{Atenció a la vacil·lació
5976 \emph{regla}/\emph{acció} en la resta del document. Sempre havia
5977 cregut que era \emph{regla}=\emph{patró}+\emph{acció}.}
5978
5979
5980 \subsubsection{\texttt{Postchunk} module}
5981 \label{ss:postchunk}
5982
5983 The \texttt{postchunk} module detects single chunks and, for each of
5984 them, performs the specified actions. Detection is based on the lemma
5985 of the chunk, and not in patterns (not in tags); this causes detection
5986 in this module to be done specific for each ``name'' of
5987 chunk.\nota{Quan fixem bé la terminologia hem d'assegurar-nos que la
5988 redacció d'aquesta part és l'adequada.}
5989
5990
5991 On the other hand, detection and processing in rules is based on the
5992 fact that references to parameters are solved right after detection,
5993 that is, the tags \texttt{<1>}, \texttt{<2>}, etc. are automatically
5994 replaced by the value of the parameters before the processing
5995 begins. Positions (attribute \texttt{pos}) specified in instructions
5996 such as \texttt{<clip>}, refer to the position of the words inside the
5997 chunk.
5998
5999 Also the case policy is automatically applied (see Section
6000 \ref{ss:majuscules}) from the pseudolemma of the chunk to the words
6001 inside the chunk.
6002
6003
6004
6005 \paragraph{Input/output}
6006
6007 \begin{itemize}
6008 \item Input: chunks from the \texttt{in\-ter\-chunk}.
6009 \item Output: valid input for the morphological generator of Apertium.
6010 \end{itemize}
6011
6012 \paragraph{Data files}
6013
6014 This program has its own specification file, which will have the
6015 extension \texttt{.t3x}. Its syntax is based as well on the
6016 \texttt{chunker} and the \texttt{in\-ter\-chunk}.  \nota{Explicar que
6017 no ha de llegir cap fitxer compilat de patrons perquè usa noms i no
6018 patrons?}
6019
6020 \paragraph{Pattern matching}
6021
6022 Chunk matching is based on the name of the chunk. Unmatched chunks
6023 receive the default processing.
6024
6025 \paragraph{How it works}
6026
6027 The differences with regard to the \texttt{in\-ter\-chunk} module are
6028 the following:
6029
6030 \begin{itemize}
6031 \item It is not allowed to write chunks (\texttt{<chunk>}) in the
6032   output: only lexical units (\texttt{<lu>} or \texttt{<mlu>}) and
6033   blanks can be written.  \nota{Comprovar aquest ítem perquè era
6034   incomplet i l'ha completat mlf}
6035 \item New detection attribute \texttt{name} in \texttt{<cat-item>},
6036 which is used in the \texttt{<pattern>} part of rules isolatedly, to
6037 force pattern detection basing on its name.  \nota{mlf: Què vol dir
6038 ``de manera aïllada''? Sembla que vulga dir ``de tant en tant''. MG:
6039 the attribute 'name' is used in the pattern part of rules? is this
6040 correct?}
6041 \item Also the attribute \texttt{side} is not used here, as in the
6042 \texttt{in\-ter\-chunk}, for the same reason: the bilingual dictionary
6043 is not looked up.  \nota{MG: però llavors això no és una diferència
6044 respecte de \texttt{interchunk} no?}
6045 \end{itemize}
6046
6047 \paragraph{Default action}
6048
6049
6050 In this module, the default action is to write the words contained in
6051 the chunks, replacing the references with the parameters of the
6052 chunk. It will be applied to most chunks, since it is foreseen that
6053 this module performs non-default actions only for specific cases
6054 requiring some special processing.
6055
6056 Also the case policy is applied by default (see Section
6057 \ref{ss:majuscules}).
6058
6059 In any case, blanks outside chunks are copied in the same order as are
6060 read, since chunk matching is done individually (this module does
6061 not group chunks).
6062
6063
6064
6065
6066 \subsection{Preprocessing of the structural transfer module}
6067 \label{ss:preproceso_transfer}
6068
6069 Specification files for the structural transfer modules, also called
6070 \emph{transfer rules files}, are pre-processed by the program
6071 \textit{apertium-preprocess-transfer}, which calculates the patterns
6072 to match rules preconditions, and indexes the rules to speed up its
6073 processsing during execution time.  This information is saved in a
6074 binary file which is read together with the bilingual dictionary and
6075 the rules file itself, because the structural transfer and lexical
6076 transfer modules are executed together.
6077
6078
6079 \section{De-formatter and re-formatter}
6080 \label{se:desformat}
6081
6082
6083 \subsection{Format processing}
6084 \label{ss:formato}
6085
6086 This section describes how the de-formatter and re-formatter process
6087 the format of the documents. These two modules are created from a set
6088 of format specification rules in XML, which are described in Section
6089 \ref{ss:reglasformato}.
6090
6091
6092 Apertium can process documents in XML, HTML, RTF and plain text. For
6093 all these document types, format is \textit{encapsulated} as explained
6094 in the following lines.
6095
6096 Text strings that are identified as part of the format ---from now on
6097 referred to as \textit{blocks of format} or \textit{superblanks}---
6098 are encapsulated between delimiters that depend on the specification
6099 of the data flow between modules (which is described in detail in
6100 Section~\ref{se:flujodatos}); so, in the flow format (sections
6101 \ref{se:noxml1} and \ref{se:noxml2}), \emph{superblanks} are put
6102 between brackets '\texttt{[}' and '\texttt{]}'.  Each of these
6103 encapsulated strings will be treated as it were a blank
6104 \texttt{<\textbf{b}/>} (page~\pageref{s3:b}) ---that is why they are
6105 called \textit{superblanks}--- and will be restored in the correct
6106 order in the translator's output.
6107
6108 As has been explained in Section \ref{se:flujodatos}, when the blocks
6109 of format are large (as is sometimes the case in HTML with Javascript
6110 code fragments, or in RTF with bitmap images), these blocks will be
6111 saved as temporary files so that they can be removed from the data
6112 flow of the translation.
6113
6114 Sometimes, the format in a document can implicitly indicate the
6115 division of the text into sentences (see page \pageref{finfrase} in
6116 Section \ref{se:flujodatos}). For example, section or document titles
6117 can be a sentence without full stop.  If we know that a format mark is
6118 indicating this division, we have to take advantage of this
6119 information in order to do a better translation.  Some examples of
6120 format that give us data about the end of a sentence are: two
6121 consecutive line breaks in plain text format, a \texttt{</h1>} tag in
6122 HTML, etc. The de-formatter generates in such cases a mark of sentence
6123 end that is equivalent to a full stop.
6124
6125 \subsubsection{Format encapsulation method}
6126
6127 The types of blocks of format or \emph{superblanks} that can be
6128 generated as a result of the format processing are the following:
6129
6130 \begin{itemize}
6131 \item \textit{Non-empty blocks of format or superblanks}.  They
6132 contain exclusively format marks of the source document. In the data
6133 flow described in Section~\ref{se:flujodatos} , they begin with a left
6134 square bracket '\texttt{[}' and end with a right square bracket
6135 '\texttt{]}'.
6136 \item \textit{Blocks of format with reference to an external file} or
6137 \textit{extensive superblanks}.  They encapsulate long format fragments
6138 in a way that improves the translator's performance. In the data flow
6139 described in Section~\ref{se:flujodatos}, they begin with the
6140 characters '\texttt{[@}', then there is the name of the file where the
6141 format fragment extracted from the source text is saved, and finally
6142 they end with a right square bracket '\texttt{]}'.
6143 \item \textit{Empty blocks of format}. They contain artificial
6144 information on text division obtained from the format data.  Before
6145 the empty block of format, the system places the appropriate
6146 artificial punctuation mark.  When the original format is restored in
6147 the document at the end of the process, the presence of a block of
6148 format like this will cause the deletion of the character right before
6149 the block in the data flow.
6150 \end{itemize}
6151
6152 %% [movido al apéndice]
6153 %% Dentro de los bloques de formato, los caracteres '\texttt{[}', '\texttt{]}',
6154 %% '\texttt{@}' y '\verb!\!' se escapan mediante las secuencias de escape
6155 %% '\verb!\[!', '\verb!\]!', '\verb!\@!' y '\verb!\\!', respectivamente.  Esto
6156 %% hay que tenerlo en cuenta para encapsular y desencapsular.  En el exterior de
6157 %% los bloques de formato es necesario también escapar los corchetes de apertura
6158 %% y cierre.
6159
6160 The general criteria applied to the creation of blocks of format are
6161 the following:
6162
6163 \label{pg:criteri}
6164 \begin{itemize}
6165 \item Everything that is considered not to be part of the text to be
6166 translated, has to be encapsulated in blocks of format.
6167 \item There can not be two or more strictly consecutive non-empty
6168 blocks of format.  Two consecutive blocks of format must be merged
6169 into a single block.
6170 \item Empty blocks of format must precede a non-empty block of format
6171 or the end of the file.
6172 \end{itemize}
6173
6174 Figure~\ref{fg:ejemplopelado} shows an example document the format of
6175 which must be processed before translation; the encapsulation
6176 corresponds to the flow format not based on
6177 XML. Figure~\ref{fg:ejemploencapsulado} displays the result of
6178 processing the mentioned document.
6179
6180
6181
6182 \begin{figure}[htbp]
6183 \begin{small}
6184 \begin{alltt}
6185 <html>
6186 <head>
6187 <title>This is the title</title>
6188 <script>
6189 <!-- ... (an extensive code block) -->
6190 </script>
6191 </head>
6192 <body>
6193 <p>This
6194 is a paragraph in two lines</p>
6195 </body>
6196 </html>
6197 \end{alltt}
6198 \end{small}
6199 \caption{Example of HTML document}
6200 \label{fg:ejemplopelado}
6201 \end{figure}
6202
6203 \begin{figure}[htbp]
6204 \begin{small}
6205 \begin{alltt}
6206 \textbf{[<html>
6207 <head>
6208 <title>]}This is the title\textbf{.[][@/tmp/temp35345]}This\textbf{[
6209 ]}is a paragraph in two lines\textbf{.[][</p>
6210 </body>
6211 </html>]}
6212 \end{alltt}
6213 \end{small}
6214 \caption{Example of HTML document where the blocks of format have been
6215   encapsulated by the de-formatter}\nota{repeteix coses capítol format
6216   -revisar -Gema}
6217 \label{fg:ejemploencapsulado}
6218 \end{figure}
6219
6220  We would like to emphasize the following from this example:
6221 \begin{itemize}
6222 \item The system does not generate consecutive blocks of format with
6223 content (non-empty).
6224 \item Tags like \texttt{</\textbf{title}>} or \texttt{</\textbf{p}>}
6225 cause the insertion of an artificial punctuation mark; this insertion
6226 is done systematically, even when it is not necessary, because it does
6227 not interfere and is efficient.
6228 \item Extensive superblanks are literally removed from the translation
6229 process. In this case, the temporary file \texttt{temp35345} contains
6230 the tags from \texttt{</\textbf{title}>} to \texttt{<\textbf{p}>}
6231 \item Simple blanks between words are not encapsulated.  But the
6232 system does encapsulate multiple blanks (two or more consecutive
6233 blanks), tabs, etc. Also line breaks are encapsulated.
6234 \end{itemize}
6235
6236
6237
6238
6239
6240
6241 \subsection{Data: format specification rules}
6242 \label{ss:reglasformato} This section describes how the de-formatter
6243 and re-formatter are generated from a format specification in XML.
6244
6245
6246 Rules for format, like linguistic data, are specified in XML, and they
6247 contain regular expressions with \texttt{flex} syntax.  The
6248 specification is divided in three parts (see its DTD in the Appendix
6249 \ref{ss:dtd_formato}):
6250
6251 \begin{itemize}
6252 \item \textbf{Configuration options}. Here one specifies the value for
6253 the maximum length of a non-extensive superblank, the input and output
6254 encodings, whether case must be considered, and the regular expressions for
6255 escape characters and space characters.
6256
6257 \item \textbf{Format rules}. Describes the set of tags belonging to a
6258 specific format which have to be included in a block of format by the
6259 de-formatter. These tags may, optionally, indicate a sentence end, in which case
6260 the de-formatter will insert an artificial punctuation mark (followed
6261 by an empty block of format, as explained in the previous
6262 section). One has to specify the priority of application of the rules,
6263 although, when this is not relevant, it is possible to give the same
6264 priority to all the rules by assigning them the same value (any
6265 number).
6266
6267   Everything that is not specified as format will be left without
6268   encapsulation and, therefore, will be considered as translatable
6269   text.
6270
6271 \item \textbf{Replacement rules}. Allow to replace special characters
6272 in the text. A regular expression will recognize \nota{MG: HELP: in
6273 Spanish, "recogerá", I don't know how to translate this:
6274 include/detect/group/recognize???} a set of special characters, and
6275 will replace it with the specified characters.  For example, in HTML,
6276 the characters specified in hexadecimal have to be replaced with the
6277 corresponding entity or ASCII character. For example,
6278 \texttt{cami\&oacute;n} corresponds to \texttt{camión}.
6279 \end{itemize}
6280
6281 Rules are described in more detail next.
6282 \begin{itemize}
6283 \item Root of the specification file. The attribute \texttt{name}
6284 contains the name of the format.
6285 \begin{small}
6286 \begin{alltt}
6287 <?xml version="1.0" encoding="ISO-8859-1"?>
6288 <format name="html">
6289   <options>
6290   ...
6291   </options>
6292
6293   <rules>
6294   ...
6295   </rules>
6296 </format>
6297 \end{alltt}
6298 \end{small}
6299
6300 \end{itemize}
6301
6302 It has to include the options and rules, an example of which is
6303 presented next:
6304
6305 \begin{itemize}
6306
6307 \item Options.
6308 \begin{small}
6309 \begin{alltt}
6310   <options>
6311     <largeblocks size="8192"/>
6312     <input encoding="ISO-8859-1"/>
6313     <output encoding="ISO-8859-1"/>
6314     <escape-chars regexp='[\verb!\![\verb!\!]^\$\verb!\!\verb!\!]'/>
6315     <space-chars regexp='[ \verb!\!n\verb!\!t\verb!\!r]'/>
6316     <case-sensitive value="no"/>
6317   </options>
6318 \end{alltt}
6319 \end{small}
6320
6321 \end{itemize}
6322
6323 The element \texttt{<largeblocks>} specifies the maximum length of a
6324 non-extensive superblank, through the value of the attribute
6325 \texttt{size}.  The elements \texttt{<input>} and \texttt{<output>}
6326 specify the input and output encoding of the text, through the
6327 attribute \texttt{encoding}.
6328
6329 The element \texttt{escape-chars} specifies, by means of a regular
6330 expression declared in the value of the attribute \texttt{regexp},
6331 which characters must be escaped with a backslash.  The element
6332 \texttt{<space-chars>} specifies the set of characters that must be
6333 considered as blanks.
6334
6335 Finally, the element \texttt{case-sensitive} specifies if case is
6336 relevant in the specifications of format attributes in which regular
6337 expressions are contained.
6338
6339
6340 \begin{itemize}
6341 \item Rules. There are format rules and replacement rules.
6342 \begin{small}
6343 \begin{alltt}
6344   <rules>
6345     <format-rule ... >
6346       ...
6347     </format-rule>
6348     ...
6349
6350     <replacement-rule>
6351       ...
6352     </replacement-rule>
6353     ...
6354   </rules>
6355 \end{alltt}
6356 \end{small} The two types are described in the following points.
6357
6358 \item Format rules. The de-formatter will encapsulate in blocks of
6359 format the tags indicated by these rules in the field
6360 \texttt{regexp}. If they are begin and end tags, and everything
6361 delimited by them is format, one has to specify a \texttt{regexp} both
6362 for \texttt{begin} and for \texttt{end}:
6363 \begin{small}
6364 \begin{alltt}
6365     <format-rule eos="no" priority="1">
6366       <begin regexp='"\verb!\!\&lt;!--"'/>
6367       <end regexp='"--\verb!\!\&gt;"'/>
6368     </format-rule>
6369 \end{alltt}
6370 \end{small} Otherwise only one \texttt{begin-end} element is used:
6371 \begin{small}
6372 \begin{alltt}
6373     <format-rule eos="yes" priority="3">
6374       <begin-end regexp='"\&lt;"[/]?"li"[^\&gt;]*"\&gt;"'/>
6375     </format-rule>
6376 \end{alltt}
6377 \end{small}
6378
6379
6380 Besides, in \texttt{priority} you have to specify a priority to tell
6381 the system in which order the format rules must be applied (the
6382 absolute value is not relevant, only the order resulting from the
6383 values). In ``\texttt{eos}'' you indicate, with \texttt{yes} or
6384 \texttt{no}, whether the block of format that contains the detected
6385 pattern must be preceded by an artificial punctuation mark or
6386 not.\footnote{In all these cases, the typical entities \texttt{\&lt;}
6387 and \texttt{\&gt;} are used to represent the characters \texttt{<} and
6388 \texttt{>} respectively.}
6389
6390 \item Replacement rules. Are used to replace special characters in the
6391 text. The regular expression in the attribute \texttt{regexp} will
6392 recognize \nota{idem: help in translation of "recogerá"} a set of
6393 special characters and will replace them with the specified characters
6394 in the text to be translated.  The correspondence between original and
6395 replacement characters is stated in the attributes \texttt{source} and
6396 \texttt{target} of the \texttt{replace} elements, which can be
6397 multiple:
6398 \begin{small}
6399 \begin{alltt}
6400     <replacement-rule regexp='"\&amp;"[^;]+;'>
6401       <replace source="\&amp;Agrave;" target="À"/>
6402       <replace source="\&amp;#192;" target="À"/>
6403       <replace source="\&amp;#xC0;" target="À"/>
6404       <replace source="\&amp;#xc0;" target="À"/>
6405       <replace source="\&amp;Aacute;" target="Á"/>
6406       <replace source="\&amp;#193;" target="Á"/>
6407       <replace source="\&amp;#xC1;" target="Á"/>
6408       <replace source="\&amp;#xc1;" target="Á"/>
6409       ...
6410     </replacement-rule>
6411 \end{alltt}
6412 \end{small}
6413 \item Regular expressions of \texttt{regexp} attributes. They have the
6414 syntax used in \texttt{flex} \cite{lesk75tr}.
6415
6416 \end{itemize}
6417
6418 % DTD moguda a Apèndix
6419
6420
6421 As example of a format specification, we will give that for HTML. The
6422 explanation given in the following paragraphs can be followed looking
6423 at Figure \ref{fg:formato-html}.
6424
6425
6426 In the first place, we find the format rule that specifies in a
6427 general way all the HTML tags: it considers as HTML tag everything
6428 that begins with the sign \textbf{\texttt{<}} and ends with the sign
6429 \textbf{\texttt{>}}. This rule has the lowest priority (4) so that the
6430 more specific rules are applied preferentially.  But before
6431 considering a tag in a general way by applying this rule, some of the
6432 higher priority rules will be applied. In the case of HTML, the
6433 highest priority is for comments \texttt{<!-- ... -->}.  The marks for
6434 beginning and end \texttt{<script> </script>} and \texttt{<style>
6435 </style>}, where everything included by them is considered to be
6436 format, has priority 2.  Priority 3 is for tags that indicate end of
6437 sentence (artificial punctuation), which are \texttt{</br>},
6438 \texttt{</hr>}, \texttt{</p>}, etc.
6439
6440 Last of all are the replacement rules, which replace all the codes
6441 that begin with \texttt{\&}, as specified in the regular
6442 expression. Then, each one of the replacements is defined:
6443 \texttt{\&Agrave}, as well as \texttt{\&\#192}, \texttt{\&\#xC0} and
6444 \texttt{\&\#xc0} are replaced with \texttt{À}. The remaining special
6445 characters are declared in the same way.
6446
6447
6448
6449 \begin{figure}[htbp]
6450 \begin{small}
6451 \begin{alltt}
6452  <?xml version="1.0" encoding="ISO-8859-1"?>
6453  <format name="html">
6454    <options>
6455      <largeblocks size="8192"/>
6456      <input encoding="ISO-8859-1"/>
6457      <output encoding="ISO-8859-1"/>
6458      <escape-chars regexp='[\verb!\![\verb!\!]^\$\verb!\!\verb!\!]'/>
6459      <space-chars regexp='[ \verb!\! n\verb!\! t\verb!\! r]'/>
6460      <case-sensitive value="no"/>
6461    </options>
6462
6463    <rules>
6464     <format-rule eos="no" priority="1">
6465        <begin regexp='"\&lt;!--"'/>
6466       <end regexp='"--\&gt;"'/>
6467     </format-rule>
6468
6469     <format-rule eos="no" priority="2">
6470       <begin regexp='"\&lt;script"[^\&gt;]*"\&gt;"'/>
6471       <end regexp='"\&lt;/script"[^\&gt;]*"\&gt;"'/>
6472     </format-rule>
6473     <format-rule eos="no" priority="2">
6474       <begin regexp='"\&lt;style"[^\&gt;]*"\&gt;"'/>
6475       <end regexp='"\&lt;/style"[^\&gt;]*"\&gt;"'/>
6476     </format-rule>
6477
6478     <format-rule eos="yes" priority="3">
6479       <begin-end regexp='"\&lt;"[/]?"br"[^\&gt;]*"\&gt;"'/>
6480     </format-rule>
6481     <!-- Here come more declarations of format-rule eos="yes"-->
6482     <!-- ...                                                -->
6483
6484     <format-rule eos="no" priority="4">
6485       <begin-end regexp='"\&lt;"[a-zA-Z][^\&gt;]*"\&gt;"'/>
6486     </format-rule>
6487
6488     <replacement-rule regexp='"\&amp;"[^;]+;'>
6489       <replace source="\&amp;Agrave;" target="À"/>
6490       <replace source="\&amp;#192;" target="À"/>
6491       <replace source="\&amp;#xC0;" target="À"/>
6492       <replace source="\&amp;#xc0;" target="À"/>
6493       <!-- Here come more replace elements                -->
6494       <!-- ...                                              -->
6495     </replacement-rule>
6496   </rules>
6497 </format>
6498 \end{alltt}
6499 \end{small}
6500 \caption{Part of the rules definition for HTML format}
6501 \label{fg:formato-html}
6502 \end{figure}
6503
6504
6505 \subsection{Generation of de-formatters and re-formatters}
6506 \label{se:gendeformat}
6507
6508 To generate the de-formatter and re-formatter for a given format, the
6509 XML rules that declare the format are applied a style sheet that
6510 carries out the generation. This XSLT transformation produces a
6511 \texttt{lex} \cite{lesk75tr} file that, once compiled, is the
6512 executable of the de-formatter and the re-formatter for the specified
6513 format.
6514
6515 Thanks to the general specification of formats described in this
6516 chapter, it has been possible to define specifications for HTML, RTF
6517 and plain text.  These specifications are in the \texttt{apertium}
6518 package, in the respective files \texttt{html-format.xml},
6519 \texttt{rtf-format.xml}, \texttt{txt-format.xml}.  In particular, it
6520 is quite simple to define de-formatters and re-formatters for any XML
6521 format.
6522
6523 \chapter{Installing and running the system}
6524 \label{se:instalacion}
6525
6526
6527 \section{System requirements}
6528
6529 The system where you want to install and run Apertium must have the
6530 following programs installed:
6531
6532 \begin{itemize}
6533 \item \texttt{libxml2} version 2.6.17 or later (on Ubuntu you may need
6534 to install \texttt{libxml2-dev} too)
6535
6536 \item \texttt{xmllint} tool (usually comes with \texttt{libxml2}, but
6537 may be an independent package on your system, i.e. Debian GNU-Linux)
6538
6539 \item \texttt{xsltproc} tool (non-PowerPC users); also comes with
6540 \texttt{libxml2} but may also be an independent package in your
6541 system, as happens with the \texttt{xmllint} tool
6542
6543 \item \texttt{sabcmd} tool (PowerPC users), provided by package
6544 \texttt{sablotron}
6545
6546 \item flex 2.5.4 or earlier (in some distributions, flex-old package)
6547 \item GNU \texttt{make}, \texttt{gcc} (\texttt{g++}), \texttt{bash}
6548 shell
6549
6550 \end{itemize}
6551
6552 \section{Installing program packages}
6553
6554 To install the Apertium machine translation system programs and
6555 libraries first you need to download (from
6556 \url{http://sourceforge.net/projects/apertium}), compile and install
6557 the latest version of the following packages, in the specified order:
6558
6559 \begin{enumerate}
6560 \item \texttt{lttoolbox}
6561 \item \texttt{apertium}
6562 \end{enumerate}
6563
6564 The simplest way to compile each package is:
6565
6566 \begin{enumerate}
6567 \item Go to the directory containing the package's source code and
6568 type \texttt{./configure} to configure the package for your system.
6569 If you're using csh on an old version of System V, you might need to
6570 type \texttt{sh ./configure} instead to prevent \texttt{csh} (the
6571 default shell in old System V) from trying to execute
6572 \texttt{configure} itself. Running \texttt{configure} takes a
6573 while. While running, it prints some messages telling which features
6574 it is checking for.
6575
6576 \item Type \texttt{make} to compile the package
6577
6578 \item Type \texttt{make install} (possibly with root privileges) to
6579 install the programs and any data files and documentation.
6580
6581 \item You can remove the program binaries and object files from the
6582   source code directory by typing \texttt{make clean}. To remove also
6583   the files that \texttt{configure} created (so you can compile the
6584   package for a different kind of computer), type \texttt{make
6585   distclean}. There is also a\\ \texttt{maintainer-clean} option in
6586   the Makefile, but that is intended mainly for the package's
6587   developers. If you use it, you may have to get all sorts of other
6588   programs in order to regenerate files that came with the
6589   distribution.
6590 \end{enumerate}
6591
6592 If you don't have root privileges to install the programs in your
6593 system, you can use the \texttt{-prefix} flag with the configure
6594 script to install them at your user account. For example:
6595
6596 \begin{small}
6597 \begin{alltt}
6598   \verb!$! pwd
6599   /home/me/lttoolbox-0.9.1
6600   \verb!$! ./configure --prefix=/home/me/myinstall
6601 \end{alltt}
6602 \end{small}
6603
6604 Libraries will be installed in the \texttt{LIBDIR=\$prefix/lib}
6605 directory. If no \texttt{-prefix} flag is specified with configure
6606 script, LIBDIR will be \texttt{/usr/local/lib}.
6607
6608
6609 If you find some error to link against installed libraries in a given
6610 directory \texttt{LIBDIR}, you must either use libtool, and specify
6611 the full pathname of the library, or use the \texttt{LIBDIR} flag
6612 during linking and do at least one of the following:
6613
6614 \begin{itemize}
6615
6616 \item add \verb!LIBDIR! to the \verb!LD_LIBRARY_PATH! environment
6617 variable during execution
6618
6619 \item add \verb!LIBDIR! to the \verb!LD_RUN_PATH! environment variable
6620 during linking
6621
6622 \item use the \texttt{-Wl}, \texttt{--rpath -Wl}, \texttt{LIBDIR}
6623 linker flag
6624
6625 \item have your system administrator add \texttt{LIBDIR} to
6626 \texttt{/etc/ld.so.conf} and run \texttt{ldconfig}
6627
6628 \end{itemize}
6629
6630 See any operating system documentation about shared libraries for more
6631 information, such as the \texttt{ld(1)} and \texttt{ld.so(8)} manual
6632 pages.
6633
6634 \section{Installing data packages}
6635
6636 To install the linguistic data packages, follow these steps:
6637
6638 \begin{enumerate}
6639
6640 \item Download a data package
6641 (\texttt{apertium-}$LANG_1$\texttt{-}$LANG_2$\texttt{-}$VERSION$\texttt{.tar.gz})
6642 from Apertium's website in Sourceforge
6643 (\url{http://apertium.sourceforge.net/}). For example, to get version
6644 0.9 of the linguistic data for the Spanish--Catalan translator, you
6645 need to download the package \texttt{apertium-es-ca-0.9.tar.gz}.
6646
6647 \item Unpack the tarball in any directory, go to this directory and
6648 type \texttt{make} in the terminal. Wait while linguistic data are
6649 compiled.
6650
6651
6652 \end{enumerate}
6653
6654
6655 \section{Using the translator}
6656
6657 There are Apertium versions that work both in Linux systems (always
6658 more up-to-date) and in Windows systems.  The information in this
6659 section is intended for Linux users.
6660
6661
6662 To run the translator, you have to use the
6663 \texttt{apertium-translator} tool referring to the directory where
6664 linguistic data are saved, and specifying the translation direction
6665 (\texttt{es-ca}, \texttt{ca-es}, \texttt{es-gl}, etc.), the file
6666 format (\texttt{txt}, \texttt{html}, \texttt{rtf}), the name of the
6667 file to be translated and the name of the output file. So, the command
6668 structure is as follows:
6669
6670
6671 \begin{small}
6672 \begin{alltt}
6673 \$ apertium-translator <directory> <translation> <format> \\
6674                            < input_file > output_file
6675 \end{alltt}
6676 \end{small}
6677
6678
6679 For example, if your directory is \texttt{/home/maria/apertium-es-ca},
6680 you have to type the following to translate a file in \texttt{txt}
6681 format from Spanish to Catalan:
6682
6683 \begin{small}
6684 \begin{alltt}
6685 \$ apertium-translator /home/maria/apertium-es-ca es-ca \\txt <file_sp >file_ca
6686 \end{alltt}
6687 \end{small}
6688
6689 It is recommended to go to the directory where linguistic data are
6690 saved, because this way you only need to type a dot to refer to the
6691 current directory:
6692
6693 \begin{small}
6694 \begin{alltt}
6695 \$ apertium-translator . es-ca txt <file_sp >file_ca
6696 \end{alltt}
6697 \end{small}
6698
6699 If no format is specified, the default format is \texttt{txt}. When
6700 working with the \texttt{txt}, \texttt{html} and \texttt{rtf} formats,
6701 unknown words are marked with an asterisk (*) and errors with a symbol
6702 (@, \# or /); if you wish that neither unknown words nor errors are
6703 marked, you have to add a \texttt{u} to the format name. Therefore,
6704 the format options are the following:
6705
6706 \begin{itemize}
6707 \item \texttt{txt} : Default option, text with marks for unknown words
6708 and errors
6709
6710 \item \texttt{txtu} : text without marks for unknown words and errors
6711
6712 \item \texttt{html} : HTML with marks for unknown words and errors
6713
6714 \item \texttt{htmlu} : HTML without marks for unknown words and errors
6715
6716 \item \texttt{rtf} : RTF with marks for unknown words and errors
6717
6718 \item \texttt{rtfu} : RTF without marks for unknown words and errors
6719
6720 \end{itemize}
6721
6722 If you do not wish to translate a file but just a sentence or a
6723 paragraph in the screen, you can run the \texttt{apertium-translator}
6724 tool without specifying any file name. The command, if you are in the
6725 directory where linguistic data are saved, would be the following:
6726
6727 \begin{small}
6728 \begin{alltt}
6729 \$ apertium-translator . es-ca
6730 \end{alltt}
6731 \end{small}
6732
6733 Then, you have to type or paste the text you wish to translate (it can
6734 contain line breaks). To get the translated version, press Ctrl +
6735 D. The translation will be displayed on the screen.
6736
6737 A third way of translating with Apertium is using the \texttt{echo}
6738 command to send text through the translator:
6739
6740 \begin{small}
6741 \begin{alltt}
6742 \$ echo "text to be translated" | apertium-translator . es-ca
6743 \end{alltt}
6744 \end{small}
6745
6746
6747
6748 \chapter{Maintaining linguistic data}
6749 \label{se:datosling}
6750
6751 \notavisible{Perhaps one could integrate material from Fran Tyers' howto as found in Apertium Wiki}
6752 \section[Description of current data]{Description of linguistic data
6753 currently available}
6754
6755  At present, Apertium has linguistic data for three language pairs
6756  \nota{MG: This is old, needs UPDATING}: Spanish--Catalan and
6757  Spanish--Galician. The files containing the linguistic data are saved
6758  in a single directory: \texttt{apertium-es-ca} for the pair
6759  Spanish--Catalan and \texttt{apertium-es-gl} for the pair
6760  Spanish--Galician. The names of the files in this directory have the
6761  following structure:
6762
6763 \begin{itemize}\setlength{\itemsep}{-\parsep}
6764     \item \texttt{apertium-PAIR.LANG.dix} : monolingual dictionary for
6765     LANG.
6766     \item \texttt{apertium-PAIR.LANG1-LANG2.dix} :
6767     \texttt{LANG1-LANG2} bilingual dictionary.
6768     \item \texttt{apertium-PAIR.trules-LANG1-LANG2.xml} : structural
6769     transfer rules for the translation from \texttt{LANG1} to
6770     \texttt{LANG2} .
6771     \item \texttt{apertium-PAIR.LANG.tsx} : tagger definition file for
6772     \texttt{LANG}.
6773     \item \texttt{apertium-PAIR.post-LANG.dix} : Post-generation
6774     dictionary for \texttt{LANG} (applies when translating into
6775     \texttt{LANG}).
6776     \item directory \texttt{LANG-tagger-data} : contains data needed
6777     for the \texttt{LANG} tagger (corpora, etc.)
6778
6779 \end{itemize}
6780
6781 \texttt{apertium-PAIR} refers to the linguistic combination of the
6782 translator. Its two possible values at the moment are
6783 \texttt{apertium-es-ca} and \\ \texttt{apertium-es-gl}. According to
6784 this structure, the Catalan monolingual dictionary is called
6785 \texttt{apertium-es-ca.ca.dix}, the Spanish--Galician bilingual
6786 dictionary is called \texttt{apertium-es-gl.es-gl.dix} and the
6787 structural transfer rules file for the translation from Catalan into
6788 Spanish is called \texttt{apertium-es-ca.trules-ca-es.xml}.
6789
6790
6791 The linguistic data available (by January 2006) for the different
6792 language pairs are summarized in the following table.
6793 \begin{small}
6794 \begin{center}
6795 \begin{tabular}{|p{8cm}|p{5cm}|} \hline
6796 \multicolumn{2}{|c|}{\textbf{Translator Apertium-es-ca}} \\ \hline
6797 Spanish monolingual dictionary & 11.800 entries \\ Catalan monolingual
6798 dictionary & 11.800 entries \\ Spanish--Catalan bilingual dictionary &
6799 12.800 entries (correspondences \texttt{es-ca})\\ Structural transfer
6800 rules from Spanish into Catalan & 44 rules \\ Structural transfer
6801 rules from Catalan into Spanish & 58 rules \\ Spanish post-generation
6802 dictionary & 25 entries and 5 paradigms\\ Catalan post-generation
6803 dictionary & 16 entries and 57 paradigms\\ \hline
6804 \multicolumn{2}{|c|}{\textbf{Translator Apertium-es-gl}} \\ \hline
6805 Spanish monolingual dictionary & 9.000 entries \\ Galician monolingual
6806 dictionary & 8.600 entries \\ Spanish--Galician bilingual dictionary &
6807 8.500 entries (correspondences \texttt{es-gl})\\ Structural transfer
6808 rules from Spanish into Galician & 46 rules \\ Structural transfer
6809 rules from Galician into Spanish & 38 rules \\ Spanish post-generation
6810 dictionary & 36 entries and 12 paradigms\\ Galician post-generation
6811 dictionary & 74 entries and 48 paradigms\\ \hline
6812 \end{tabular}
6813 \end{center}
6814 \end{small}
6815
6816
6817 \section[Adding words to dictionaries]{Adding words to monolingual and
6818 bilingual dictionaries}
6819
6820
6821 When extending or adapting Apertium, the most likely operation that
6822 will be performed will be to extend its dictionaries. In fact, it will
6823 be far more common than adding transfer or post-generation rules.
6824
6825 We describe next the most important things one has to take into
6826 account when adding new words to the translator. This information is
6827 more general than the data provided in the section describing
6828 dictionaries (chapter \ref{ss:diccionarios}), although we give here
6829 some practical information that might be very useful to the users who
6830 decide to make changes in the translator.
6831
6832 IMPORTANT: Every time a set
6833 of modifications is made to any of the dictionaries, the modules have
6834 to be recompiled. Type \emph{make} in the directory where the linguistic data
6835 are saved (apertium-es-ca, apertium-es-gl or what may be applicable)
6836 so that the system generates the new binary files.
6837
6838 If you want to add a new word to Apertium, you need to add three
6839 entries in the dictionaries. Suppose you are working with the
6840 Spanish-Catalan pair.  In this case, you have to add:
6841
6842 \begin{enumerate}
6843 \item an entry in the Spanish monolingual dictionary: so that the
6844 translator can analyze ("understand") the word when it finds it in a
6845 text, and generate it when translating this word into Spanish.
6846
6847 \item an entry in the bilingual dictionary: so that you can tell
6848 Apertium how to translate this word from one language to the other.
6849
6850 \item an entry in the Catalan monolingual dictionary: so that the
6851 translator can analyze ("understand") the word when it finds it in a
6852 text, and generate it when translating this word into Catalan.
6853 \end{enumerate}
6854
6855 You will need to go to the directory containing the XML dictionaries
6856 (for the Spanish-Catalan pair, this is \texttt{apertium-es-ca}) and
6857 open with a text editor or a specialized XML editor the three
6858 dictionary files mentioned: \texttt{apertium-es-ca.es.dix},
6859 \texttt{apertium-es-ca.es-ca.dix} and
6860 \texttt{apertium-es-ca.ca.dix}. The entries you need to create in
6861 these three dictionaries share a common structure.  \\
6862
6863 \textbf{Monolingual dictionary (Spanish)}
6864
6865
6866 You may want, for example, to add the Spanish adjective
6867 \emph{cósmico}, whose equivalent in Catalan is \emph{còsmic}. The
6868 first step is to add this word to the Spanish monolingual dictionary.
6869
6870 You will see that a monolingual dictionary has basically two types of
6871 data: \textbf{paradigms} (in the "\texttt{<pardefs>}" section of the
6872 dictionary, each paradigm inside a \texttt{<pardef>} element) and
6873 \textbf{word entries} (in the main (\texttt{<section>} of the
6874 dictionary, each one inside an \texttt{<e>} element). Word entries
6875 consist of a lemma (that is, the word as you would find it in a
6876 typical paper dictionary) plus grammatical information; paradigms
6877 contain the inflection data of all lemmas in the dictionary. You can
6878 search a particular word by searching the string \texttt{lm="word"}
6879 (\texttt{lm} meaning \emph{lemma}).  Bear in mind, however, that the
6880 element \texttt{lm} is optional and some other dictionaries may not
6881 contain it.
6882
6883 Look at the word entries in the Spanish monolingual dictionary, for
6884 example at the entry for the adjective \emph{bonito}. You can find it
6885 by searching \texttt{lm="bonito"}:
6886
6887 \begin{small}
6888 \begin{alltt}
6889 <\textbf{e} \textsl{lm}="bonito">
6890   <\textbf{i}>bonit</\textbf{i}>
6891   <\textbf{par} \textsl{n}="absolut/o__adj"/>
6892 </\textbf{e}>
6893 \end{alltt}
6894 \end{small}
6895
6896 To add a word, you will have to create an entry with the same
6897 structure. The part between \texttt{<i>} and \texttt{</i>} contains
6898 the prefix of the word that is common to all inflected forms, and the
6899 element \texttt{<par>} refers to the inflection paradigm of this
6900 word. Therefore, this entry means that the adjective \emph{bonito}
6901 inflects like the adjective \emph{absoluto} and has the same
6902 morphological analysis: the forms \emph{bonit\textbf{o}},
6903 \emph{bonit\textbf{a}}, \emph{bonit\textbf{os}},
6904 \emph{bonit\textbf{as}} are equivalent to the forms
6905 \emph{absolut\textbf{o}}, \emph{absolut\textbf{a}},
6906 \emph{absolut\textbf{os}}, \emph{absolut\textbf{as}} and have the
6907 morphological analysis: \texttt{adj m sg}, \texttt{adj f sg},
6908 \texttt{adj m pl} and \texttt{adj f pl} respectively.
6909
6910 Now, you have to decide which is the lexical category of the word you
6911 want to add: the word \emph{cósmico} is an adjective, like
6912 \emph{bonito}. Next, you have to find the appropriate paradigm for
6913 this adjective. Is it the same as the one for \emph{bonito} and
6914 \emph{absoluto}?  ¿Can you say \emph{cósmic\textbf{o}},
6915 \emph{cósmic\textbf{a}}, \emph{cósmic\textbf{os}},
6916 \emph{cósmic\textbf{as}}? The answer is yes, and, with all this
6917 information, you can now create the correct entry:
6918
6919 \begin{small}
6920 \begin{alltt}
6921 <\textbf{e} \textsl{lm}="cósmico">
6922     <\textbf{i}>cósmic</\textbf{i}>
6923   <\textbf{par} \textsl{n}="absolut/o__adj"/>
6924 </\textbf{e}>
6925 \end{alltt}
6926 \end{small}
6927
6928
6929 If the word you want to add has a different paradigm, you have to find
6930 it in the dictionary and assign it to the entry. You have two ways to
6931 find the appropriate paradigm: looking in the \texttt{<pardefs>}
6932 section of the dictionary, where all the paradigms are defined inside
6933 a \texttt{<pardef>} element, or finding another word that you think
6934 may already exist in the dictionary and that has the same inflection
6935 paradigm as the one to be added. For example, if you want to add the
6936 word \emph{genoma}, you need to find an appropriate paradigm for a
6937 \textbf{noun} whose gender is masculine and forms the plural with the
6938 addition of an \textbf{-s}. This will be the paradigm
6939 "\texttt{abismo\_\_n}" in our present dictionaries. Therefore, the
6940 entry for this new word would be:
6941
6942 \begin{small}
6943 \begin{alltt}
6944 <\textbf{e} \textsl{lm}="genoma">
6945     <\textbf{i}>genoma</\textbf{i}>
6946   <\textbf{par} \textsl{n}="abismo__n"/>
6947 </\textbf{e}>
6948 \end{alltt}
6949 \end{small}
6950
6951 In exceptional cases you will need to create a new paradigm for a
6952 certain word. You can look at the structure of other paradigms and
6953 create one accordingly. For a more detailed description of paradigms
6954 and word entries in the dictionaries, refer to section
6955 \ref{ss:diccionarios}.  \\
6956
6957 \textbf{Monolingual dictionary (Catalan)}
6958
6959 Once you have added the word to one monolingual dictionary, you have
6960 to do the same to the other monolingual dictionary of the translation
6961 pair (in our example, the Catalan monolingual dictionary) using the
6962 same structure. The result would be:
6963
6964 \begin{small}
6965 \begin{alltt}
6966 <\textbf{e} \textsl{lm}="còsmic">
6967     <\textbf{i}>còsmi</\textbf{i}>
6968   <\textbf{par} \textsl{n}="acadèmi/c__adj"/>
6969 </\textbf{e}>
6970 \end{alltt}
6971 \end{small}
6972
6973 \textbf{Monolingual dictionary (Galician)}
6974
6975 In the case you are trying to improve the XML dictionaries for the
6976 Spanish-Galician pair, you will need to go to the directory
6977 \texttt{apertium-es-gl} and open with a text editor or a specialized
6978 XML editor the three dictionary files \texttt{apertium-es-gl.es.dix},
6979 \texttt{apertium-es-gl.es-gl.dix} and
6980 \texttt{apertium-es-gl.gl.dix}. In that case, once you have added the
6981 new Spanish word \emph{genoma} to the Spanish monolingual dictionary
6982 (\texttt{apertium-es-gl.es.dix}), you have to add the equivalent
6983 Galician word \emph{xenoma} to the Galician monolingual dictionary
6984 (\texttt{apertium-es-gl.gl.dix}), that is:
6985
6986 \begin{small}
6987 \begin{alltt}
6988 <\textbf{e} \textsl{lm}="xenoma">
6989     <\textbf{i}>xenoma</\textbf{i}>
6990   <\textbf{par} \textsl{n}="Xulio__n"/>
6991 </\textbf{e}>
6992 \end{alltt}
6993 \end{small}
6994
6995 \textbf{Bilingual dictionary}
6996
6997 The last step is to add the translation to the bilingual dictionary.
6998
6999 A bilingual dictionary does not usually have paradigms, only
7000 lemmas. An entry contains only the lemma in both languages and the
7001 first grammatical symbol (the lexical category) of each one. Entries
7002 have a left side (\texttt{<l>}) and a right side (\texttt{<r>}), and
7003 each language has always to be in the same position: in our system, it
7004 has been agreed that Spanish occupies the left side, and Catalan,
7005 Galician and Portuguese the right side.
7006
7007
7008 With the addition of the lemma of both words, the system will
7009 translate all their inflected forms (the grammatical symbols are
7010 copied from the source language word to the target language
7011 word). This will only work if the source language word and the target
7012 language word are grammatically equivalent, that is, if they share
7013 exactly the same grammatical symbols for all of their inflected
7014 forms. This is the case with our example; therefore, the entry you
7015 have to add to the bilingual dictionary is:
7016
7017
7018 \begin{small}
7019 \begin{alltt}
7020 <\textbf{e}>
7021   <\textbf{p}>
7022     <\textbf{l}>cósmico<\textbf{s} \textsl{n}="adj"/></\textbf{l}>
7023     <\textbf{r}>còsmic<\textbf{s} \textsl{n}="adj"/></\textbf{r}>
7024   </\textbf{p}>
7025 </\textbf{e}>
7026 \end{alltt}
7027 \end{small}
7028
7029 This entry will translate all the inflected forms, that is,
7030 \texttt{adj m sg}, \texttt{adj f sg}, \texttt{adj m pl} and
7031 \texttt{adj f pl}. It works for the translation in both directions:
7032 from Spanish to Catalan and from Catalan to Spanish.
7033
7034 In the case of the Spanish-Galician pair, the following bilingual
7035 entry in the Spanish-Galician bilingual dictionary
7036 (\texttt{apertium-es-gl.es-gl.dix}) will translate all the inflected
7037 forms for the equivalent words \emph{genoma}/\emph{xenoma} in both
7038 directions:
7039
7040 \begin{small}
7041 \begin{alltt}
7042 <\textbf{e}>
7043   <\textbf{p}>
7044     <\textbf{l}>genoma<\textbf{s} \textsl{n}="n"/></\textbf{l}>
7045     <\textbf{r}>xenoma<\textbf{s} \textsl{n}="n"/></\textbf{r}>
7046   </\textbf{p}>
7047 </\textbf{e}>
7048 \end{alltt}
7049 \end{small}
7050
7051 What to do if the word pair is not equivalent grammatically (their
7052 grammatical symbols are not exactly the same)? In that case, you need
7053 to specify all the grammatical symbols (in the same order as they are
7054 specified in the monolingual dictionaries) until you reach the symbol
7055 that differs between the source language word and the target language
7056 word. For example, the Spanish noun \emph{limón} has masculine gender
7057 and its equivalent in Catalan, \emph{llimona}, has feminine
7058 gender. The entry in the bilingual dictionary must be as follows:
7059
7060 \begin{small}
7061 \begin{alltt}
7062 <\textbf{e}>
7063   <\textbf{p}>
7064     <\textbf{l}>limón<\textbf{s} \textsl{n}="n"/><\textbf{s} \textsl{n}="m"/></\textbf{l}>
7065     <\textbf{r}>llimona<\textbf{s} \textsl{n}="n"/><\textbf{s} \textsl{n}="f"/></\textbf{r}>
7066   </\textbf{p}>
7067 </\textbf{e}>
7068 \end{alltt}
7069 \end{small}
7070
7071
7072 A more difficult problem arises when two words have different
7073 grammatical symbols and the grammatical information of the source
7074 language word is not enough to determine the gender (masculine or
7075 feminine) or the number (singular or plural) of the target language
7076 word. Take for example the Spanish adjective \emph{canadiense}. Its
7077 gender is masculine--feminine since it is invariable in gender, that
7078 is, it can go both with masculine and feminine nouns (\emph{hombre
7079 canadiense}, \emph{mujer canadiense}). In Catalan, on the other hand,
7080 the adjective has a different inflection for the masculine and the
7081 feminine (\emph{home canadenc}, \emph{dona canadenca}). Therefore,
7082 when translating from Spanish to Catalan it is not possible to know,
7083 without looking at the accompanying noun, whether the Spanish
7084 adjective (\emph{mf}) has to be translated as a feminine or a
7085 masculine adjective in Catalan. In that case, the symbol \texttt{GD}
7086 (for "gender to be determined") is used instead of the gender
7087 symbol. \label{GDND} The word's gender will be determined by the
7088 structural transfer module, by means of a transfer rule (a rule that
7089 detects the gender of the preceding noun in this particular
7090 case). Therefore, \texttt{GD} must be used only when translating from
7091 Spanish to Catalan, but not when translating from Catalan to Spanish,
7092 as in Spanish the gender will always be \texttt{mf} regardless of the
7093 gender of the original word.  In the bilingual dictionary you will
7094 need to add, in this case, more than one entry with direction
7095 indications, as you must specify in which translation direction the
7096 gender remains undetermined. The entries for this adjective should be
7097 as follows:
7098
7099 \begin{small}
7100 \begin{alltt}
7101 <\textbf{e} \textsl{r}="LR">
7102   <\textbf{p}>
7103     <\textbf{l}>canadiense<\textbf{s} \textsl{n}="adj"/><\textbf{s} \textsl{n}="mf"/></\textbf{l}>
7104     <\textbf{r}>canadenc<\textbf{s} \textsl{n}="adj"/><\textbf{s} \textsl{n}="GD"/></\textbf{r}>
7105   </\textbf{p}>
7106 </\textbf{e}>
7107 <\textbf{e} \textsl{r}="RL">
7108   <\textbf{p}>
7109     <\textbf{l}>canadiense<\textbf{s} \textsl{n}="adj"/><\textbf{s} \textsl{n}="mf"/></\textbf{l}>
7110     <\textbf{r}>canadenc<\textbf{s} \textsl{n}="adj"/><\textbf{s} \textsl{n}="f"/></\textbf{r}>
7111   </\textbf{p}>
7112 </\textbf{e}>
7113 <\textbf{e} \textsl{r}="RL">
7114   <\textbf{p}>
7115     <\textbf{l}>canadiense<\textbf{s} \textsl{n}="adj"/><\textbf{s} \textsl{n}="mf"/></\textbf{l}>
7116     <\textbf{r}>canadenc<\textbf{s} \textsl{n}="adj"/><\textbf{s} \textsl{n}="m"/></\textbf{r}>
7117   </\textbf{p}>
7118 </\textbf{e}>
7119 \end{alltt}
7120 \end{small}
7121
7122 "\texttt{LR}" means \emph{left to right} and "\texttt{RL}",
7123 \emph{right to left}. Since Spanish is on the left and Catalan on the
7124 right, the adjective will be \texttt{GD} only when translating from
7125 Spanish to Catalan (\texttt{LR}). For the translation \texttt{RL} you
7126 need to create two entries, one for the adjective in feminine and
7127 another one for the adjective in masculine.\footnote{You could also
7128 group them using a small paradigm}
7129
7130 The same principle applies when it is not possible to determine the
7131 number of the target word for the same reasons mentioned above. For
7132 example, the Spanish noun \emph{rascacielos} ("skyscraper") is
7133 invariable in number, that is, it can be singular as well as plural
7134 (\emph{un rascacielos}, \emph{dos rascacielos}). In Catalan, on the
7135 other hand, the noun has a different inflection for the singular and
7136 for the plural (\emph{un gratacel}, \emph{dos gratacels}).  In this
7137 case the symbol used is "\texttt{ND}" ("number to be determined") and
7138 the entries should be like this:
7139
7140
7141 \begin{small}
7142 \begin{alltt}
7143 <\textbf{e} \textsl{r}="LR">
7144   <\textbf{p}>
7145     <\textbf{l}>rascacielos<\textbf{s} \textsl{n}="n"/><\textbf{s} \textsl{n}="m"/><\textbf{s} \textsl{n}="sp"/></\textbf{l}>
7146     <\textbf{r}>gratacel<\textbf{s} \textsl{n}="n"/><\textbf{s} \textsl{n}="m"/><\textbf{s} \textsl{n}="ND"/></\textbf{r}>
7147   </\textbf{p}>
7148 </\textbf{e}>
7149 <\textbf{e} \textsl{r}="RL">
7150   <\textbf{p}>
7151     <\textbf{l}>rascacielos<\textbf{s} \textsl{n}="n"/><\textbf{s} \textsl{n}="m"/><\textbf{s} \textsl{n}="sp"/></\textbf{l}>
7152     <\textbf{r}>gratacel<\textbf{s} \textsl{n}="n"/><\textbf{s} \textsl{n}="m"/><\textbf{s} \textsl{n}="pl"/></\textbf{r}>
7153   </\textbf{p}>
7154 </\textbf{e}>
7155 <\textbf{e} \textsl{r}="RL">
7156   <\textbf{p}>
7157     <\textbf{l}>rascacielos<\textbf{s} \textsl{n}="n"/><\textbf{s} \textsl{n}="m"/><\textbf{s} \textsl{n}="sp"/></\textbf{l}>
7158     <\textbf{r}>gratacel<\textbf{s} \textsl{n}="n"/><\textbf{s} \textsl{n}="m"/><\textbf{s} \textsl{n}="sg"/></\textbf{r}>
7159   </\textbf{p}>
7160 </\textbf{e}>
7161 \end{alltt}
7162 \end{small}
7163
7164 For a more detailed description of this kind of entries, refer to
7165 section~\pageref{ss:bil}.
7166
7167
7168
7169 \subsection{Adding direction restrictions}
7170
7171 In the previous example we have already seen the use of direction
7172 restrictions for entries with undetermined gender or number
7173 (\texttt{GD} or \texttt{ND}). These restrictions can also be used in
7174 other cases.
7175
7176 It is important to note that the current version of Apertium can give
7177 only a single equivalent for each source-language lexical form
7178 \nota{NEEDS UPDATING, reference to lextor} (a lexical form is the
7179 lemma plus its grammatical information), that is, no word-sense
7180 disambiguation is performed.\footnote{The system performs only
7181 part-of-speech disambiguation for homograph words, that is, for
7182 ambiguous words that can be analyzed as more than one lexical form,
7183 like \emph{vino} in Spanish, that can mean both "wine" and "he/she
7184 came". This type of disambiguation is performed by the tagger.} When a
7185 lexical form can be translated in two or more different ways, one has
7186 to be chosen (the most general, the most frequent, etc.).  You can
7187 tell Apertium that a certain word has to be analyzed ("understood")
7188 but not generated, as it is not the translation of any word in the
7189 other language.
7190
7191 Let's see this with an example. The Spanish noun \emph{muñeca} can be
7192 translated in two different ways in Catalan depending on its meaning:
7193 \emph{canell} ("wrist") or \emph{nina} ("doll"). The context decides
7194 which translation is the correct one, but in its present state
7195 Apertium can not make such a decision .\footnote{See Section
7196 \ref{multi} on multiword units for ways to circumvent this problem.}
7197 Therefore, you have to decide which word you want as an equivalent
7198 when translating from Spanish to Catalan.  From Catalan to Spanish,
7199 both words can be translated as \emph{muñeca} without any problem. You
7200 have to specify all these circumstances in the dictionary entries
7201 using direction restrictions (\texttt{LR} meaning "left to right",
7202 that is, \texttt{es}--\texttt{ca}, and \texttt{RL} meaning "right to
7203 left", that is, \texttt{ca}--\texttt{es}). If you decide to translate
7204 \emph{muñeca} as \emph{canell} in all cases, the entries in the
7205 bilingual dictionary shall be:
7206
7207
7208 \begin{small}
7209 \begin{alltt}
7210 <\textbf{e}>
7211   <\textbf{p}>
7212     <\textbf{l}>muñeca<\textbf{s} \textsl{n}="n"/><\textbf{s} \textsl{n}="f"/></\textbf{l}>
7213     <\textbf{r}>canell<\textbf{s} \textsl{n}="n"/><\textbf{s} \textsl{n}="m"/></\textbf{r}>
7214   </\textbf{p}>
7215 </\textbf{e}>
7216
7217 <\textbf{e} \textsl{r}="RL">
7218   <\textbf{p}>
7219     <\textbf{l}>muñeca<\textbf{s} \textsl{n}="n"/></\textbf{l}>
7220     <\textbf{r}>nina<\textbf{s} \textsl{n}="n"/></\textbf{r}>
7221   </\textbf{p}>
7222 </\textbf{e}>
7223 \end{alltt}
7224 \end{small}
7225
7226 This means that translation directions will be:
7227 \begin{small}
7228 \begin{alltt}
7229     muñeca --> canell
7230     muñeca <-- canell
7231     muñeca <-- nina
7232 \end{alltt}
7233 \end{small}
7234
7235 (Note that that there is also a gender change in the case of
7236 \emph{muñeca} (feminine) and \emph{canell} (masculine)).
7237
7238 It should be emphasized that a lemma can not have two translations in
7239 the target language, because the system would give an error when
7240 translating that lemma (see Section \ref{errores} "Detecting errors"
7241 to see how to find and correct these and other types of errors). When
7242 a word can be translated in two different ways in the target language
7243 in all contexts, you need to choose one as the translation equivalent
7244 and leave the other one as a lemma that can be analyzed but not
7245 generated, using direction restrictions like in the previous
7246 example. For example, the Catalan lemmas \emph{mot} and \emph{paraula}
7247 can be both translated into Spanish as \emph{palabra} ("word") and the
7248 entries in the bilingual dictionary should look like this:
7249
7250 \begin{small}
7251 \begin{alltt}
7252 <\textbf{e}>
7253   <\textbf{p}>
7254     <\textbf{l}>palabra<\textbf{s} \textsl{n}="n"/></\textbf{l}>
7255     <\textbf{r}>paraula<\textbf{s} \textsl{n}="n"/></\textbf{r}>
7256   </\textbf{p}>
7257 </\textbf{e}>
7258
7259 <\textbf{e} \textsl{r}="RL">
7260   <\textbf{p}>
7261     <\textbf{l}>palabra<\textbf{s} \textsl{n}="n"/><\textbf{s} \textsl{n}="f"/></\textbf{l}>
7262     <\textbf{r}>mot<\textbf{s} \textsl{n}="n"/><\textbf{s} \textsl{n}="m"/></\textbf{r}>
7263   </\textbf{p}>
7264 </\textbf{e}>
7265 \end{alltt}
7266 \end{small}
7267
7268 Therefore, for this lemmas the translation directions will be:
7269 \begin{small}
7270 \begin{alltt}
7271     palabra --> paraula
7272     palabra <-- paraula
7273     palabra <-- mot
7274 \end{alltt}
7275 \end{small}
7276
7277 One may have to specify restrictions regarding translation direction
7278 also in monolingual dictionaries. For example, both Spanish forms
7279 \emph{cantaran} and \emph{cantasen} should be analyzed as lemma
7280 \emph{cantar}, verb, subjunctive imperfect, 3rd person plural, but
7281 when generating Spanish text, one has to decide which one will be
7282 generated. Monolingual dictionaries are read in two directions
7283 depending on its purpose: for the analysis, the reading direction is
7284 left to right; for the generation, right to left. Therefore, a word
7285 that must be analyzed but not generated must have the restriction
7286 \texttt{LR}, and a word that must be generated but not analyzed must
7287 have the restriction \texttt{RL}.
7288
7289
7290 The case of \emph{cantaran} or \emph{cantasen} must have already been
7291 taken care of in inflection paradigms and it is unlikely to be a
7292 problem for most people extending a dictionary. In some other cases it
7293 can be necessary to introduce a restriction in the word entries of
7294 monolingual dictionaries.
7295
7296 \subsection{Adding multiwords}
7297 \label{multi}
7298
7299 It is possible to create entries consisting of two ore more words, if
7300 these words are considered to build a single "translation unit".
7301 These multiword units can also be useful when it comes to select the
7302 correct equivalent for a word inside a fixed expression. For example,
7303 the Spanish word \emph{dirección} may be translated into two Catalan
7304 words: \emph{direcció} ("direction, management, directorate,
7305 steering", etc.) and \emph{adreça} ("address"); including, for
7306 example, frequent multiword units such as \emph{dirección general}
7307 \(\to\) \emph{direcció general} ("general directorate") and
7308 \emph{dirección postal} \(\to\) \emph{adreça postal} ("postal
7309 address") may help get improved translations in some situations.
7310
7311 Multiword units can be classified basically into two categories:
7312 multiwords with inner inflection and multiwords without inner
7313 inflection.
7314
7315 \subsubsection{Multiwords without inner inflection}
7316
7317 They are just like the normal one-word entries, with the only
7318 difference that you need to insert the element \texttt{<b>} (which
7319 represents a blank) between the individual words that make up the
7320 unit. Therefore, if you want to add, for example, the Spanish
7321 multiword \emph{hoy en día} ("nowadays"), whose equivalent in Catalan
7322 is \emph{avui dia}, the entries you need to add to the different
7323 dictionaries are:
7324
7325 \begin{itemize}
7326
7327 \item Spanish monolingual dictionary:
7328 \begin{small}
7329 \begin{alltt}
7330 <\textbf{e} \textsl{lm}="hoy en día">
7331   <\textbf{i}>hoy<\textbf{b}/>en<\textbf{b}/>día</\textbf{i}>
7332   <\textbf{par} \textsl{n}="ahora__adv"/>
7333 </\textbf{e}>
7334 \end{alltt}
7335 \end{small}
7336
7337 \item Catalan monolingual dictionary:
7338 \begin{small}
7339 \begin{alltt}
7340 <\textbf{e} \textsl{lm}="avui dia">
7341   <\textbf{i}>avui<\textbf{b}/>dia</\textbf{i}>
7342   <\textbf{par} \textsl{n}="ahir__adv"/>
7343 </\textbf{e}>
7344 \end{alltt}
7345 \end{small}
7346
7347 \item Spanish-Catalan bilingual dictionary:
7348 \begin{small}
7349 \begin{alltt}
7350 <\textbf{e}>
7351   <\textbf{p}>
7352     <\textbf{l}>hoy<\textbf{b}/>en<\textbf{b}/>día<\textbf{s} \textsl{n}="adv"/></\textbf{l}>
7353     <\textbf{r}>avui<\textbf{b}/>dia<\textbf{s} \textsl{n}="adv"/></\textbf{r}>
7354   </\textbf{p}>
7355 </\textbf{e}>
7356 \end{alltt}
7357 \end{small}
7358
7359 \end{itemize}
7360
7361 For Spanish-Galician pair, if you want to add, for example, the
7362 Spanish multiword \emph{manga por hombro} ("disarranged"), whose
7363 equivalent in Galician is \emph{sen xeito nin modo}, the entries you
7364 need to add are:
7365
7366 \begin{itemize}
7367
7368 \item Spanish monolingual dictionary:
7369 \begin{small}
7370 \begin{alltt}
7371 <\textbf{e} \textsl{lm}="manga por hombro">
7372   <\textbf{i}>manga<\textbf{b}/>por<\textbf{b}/>hombro</\textbf{i}>
7373   <\textbf{par} \textsl{n}="ahora__adv"/>
7374 </\textbf{e}>
7375 \end{alltt}
7376 \end{small}
7377
7378 \item Galician monolingual dictionary:
7379 \begin{small}
7380 \begin{alltt}
7381 <\textbf{e} \textsl{lm}="sen xeito nin modo">
7382   <\textbf{i}>sen<\textbf{b}/>xeito<\textbf{b}/>nin<\textbf{b}/>modo</\textbf{i}>
7383   <\textbf{par} \textsl{n}="Deo_gratias__adv"/>
7384 </\textbf{e}>
7385 \end{alltt}
7386 \end{small}
7387
7388 \item Spanish-Galician bilingual dictionary:
7389 \begin{small}
7390 \begin{alltt}
7391 <\textbf{e}>
7392   <\textbf{p}>
7393     <\textbf{l}>manga<\textbf{b}/>por<\textbf{b}/>hombro<\textbf{s} \textsl{n}="adv"/></\textbf{l}>
7394     <\textbf{r}>sen<\textbf{b}/>xeito<\textbf{b}/>nin<\textbf{b}/>modo<\textbf{s} \textsl{n}="adv"/></\textbf{r}>
7395   </\textbf{p}>
7396 </\textbf{e}>
7397 \end{alltt}
7398 \end{small}
7399
7400 \end{itemize}
7401
7402 \subsubsection{Brief introduction to paradigms}
7403
7404 The paradigms of the previous examples, as adverbs do not inflect,
7405 contain only the grammatical symbol of the lexical form, as you see in
7406 this example:
7407
7408 \begin{small}
7409 \begin{alltt}
7410 <\textbf{pardef} \textsl{n}="ahora__adv">
7411   <\textbf{e}>
7412     <\textbf{p}>
7413       <\textbf{l}/>
7414       <\textbf{r}><\textbf{s} \textsl{n}="adv"/></\textbf{r}>
7415     </\textbf{p}>
7416   </\textbf{e}>
7417 </\textbf{pardef}>
7418 \end{alltt}
7419 \end{small}
7420
7421 Paradigms are build like a lexical entry. We have seen so far lexical
7422 entries where the common part of the lemma is put between \texttt{<i>}
7423 \texttt{</i>}:
7424
7425 \begin{small}
7426 \begin{alltt}
7427 <\textbf{e} \textsl{lm}="cósmico">
7428   <\textbf{i}>cósmic</\textbf{i}>
7429   <\textbf{par} \textsl{n}="absolut/o__adj"/>
7430 </\textbf{e}>
7431 \end{alltt}
7432 \end{small}
7433
7434
7435 But you can also express the same with a pair of strings: a left
7436 string \texttt{<l>} and a right string \texttt{<r>} inside a
7437 \texttt{<p>} element:
7438
7439 \begin{small}
7440 \begin{alltt}
7441 <\textbf{e} \textsl{lm}="cósmico">
7442   <\textbf{p}>
7443     <\textbf{l}>cósmic</\textbf{l}>
7444     <\textbf{r}>cósmic</\textbf{r}>
7445   </\textbf{p}>
7446   <\textbf{par} \textsl{n}="absolut/o__adj"/>
7447 </\textbf{e}>
7448 \end{alltt}
7449 \end{small}
7450
7451
7452 These two entries are equivalent. The use of the \texttt{<i>} element
7453 helps get more simple and compact entries, and you can use it when the
7454 left side and the right side of the string pair are identical. As has
7455 been explained before, monolingual dictionaries are read \texttt{LR}
7456 for the analysis of a text and \texttt{RL} for the
7457 generation. Therefore, when there is some difference between the
7458 analysed string and the generated string (not very usual) the entry
7459 can not be written using the \texttt{<i>} element. This is what
7460 happens in paradigms, where the left and right strings are never
7461 identical, since the right side must contain the grammatical symbols
7462 that will go through all the modules of the system.
7463
7464 \subsubsection{Multiwords with inner inflection}
7465
7466
7467 They consist of a word that can inflect (typically a verb) followed by
7468 one or more invariable words. For these entries you need to specify
7469 the inflection paradigm just after the word that inflects. The
7470 invariable part must be marked with the element \texttt{<g>} (for
7471 \emph{group}) in the right side. The blanks between words are
7472 indicated, like in the previous case, with the element
7473 \texttt{<b>}. Look at the following example for the Spanish multiword
7474 \emph{echar de menos} (to miss), translated into Catalan as
7475 \emph{trobar a faltar}:
7476
7477 \begin{itemize}
7478
7479 \item Spanish monolingual dictionary:
7480 \begin{small}
7481 \begin{alltt}
7482 <\textbf{e} \textsl{lm}="echar de menos">
7483     <\textbf{i}>ech</\textbf{i}>
7484     <\textbf{par} \textsl{n}="aspir/ar__vblex"/>
7485     <\textbf{p}>
7486       <\textbf{l}><\textbf{b}/>de<\textbf{b}/>menos</\textbf{l}>
7487       <\textbf{r}><\textbf{g}><\textbf{b}/>de<\textbf{b}/>menos</\textbf{g}></\textbf{r}>
7488     </\textbf{p}>
7489 </\textbf{e}>
7490 \end{alltt}
7491 \end{small}
7492
7493 \item Catalan monolingual dictionary:
7494 \begin{small}
7495 \begin{alltt}
7496 <\textbf{e} \textsl{lm}="trobar a faltar">
7497     <\textbf{i}>trob</\textbf{i}>
7498     <\textbf{par} \textsl{n}="abander/ar__vblex"/>
7499     <\textbf{p}>
7500       <\textbf{l}><\textbf{b}/>a<\textbf{b}/>faltar</\textbf{l}>
7501       <\textbf{r}><\textbf{g}><\textbf{b}/>a<\textbf{b}/>faltar</\textbf{g}></\textbf{r}>
7502     </\textbf{p}>
7503 </\textbf{e}>
7504 \end{alltt}
7505 \end{small}
7506
7507 \item Spanish-Catalan bilingual dictionary:
7508 \begin{small}
7509 \begin{alltt}
7510 <\textbf{e}>
7511   <\textbf{p}>
7512     <\textbf{l}>echar<\textbf{g}><\textbf{b}/>de<\textbf{b}/>menos</\textbf{g}><\textbf{s} \textsl{n}="vblex"/></\textbf{l}>
7513     <\textbf{r}>trobar<\textbf{g}><\textbf{b}/>a<\textbf{b}/>faltar</\textbf{g}><\textbf{s} \textsl{n}="vblex"/></\textbf{r}>
7514   </\textbf{p}>
7515 </\textbf{e}>
7516 \end{alltt}
7517 \end{small}
7518
7519 \end{itemize}
7520
7521
7522 Note that the grammatical symbol is appended at the end, after the
7523 group marked with the \texttt{<g>}.
7524
7525 It can be the case that a lemma is a multiword of this kind in one
7526 language and a single word in the other language. In that case, in the
7527 bilingual dictionary, the multiword will contain the \texttt{<g>}
7528 element and the single word will not. In the monolingual dictionaries,
7529 each entry will be created according to its type.  Look at the
7530 following example for the Spanish multiword \emph{darse cuenta} (to
7531 realize), translated into Catalan as the verb
7532 \emph{adonar-se}:\footnote{The verb \emph{adonar-se} is considered a
7533 simple word, since the incorporation of enclitic pronouns (such as
7534 "-se") is treated inside the inflection paradigms of verbs (for all
7535 the Romance languages of \emph{Apertium}); therefore, it is not
7536 necessary to specify them in lexical entries. The correct placement of
7537 clitic pronouns is one of the main reasons for using the
7538 \texttt{<g>}... \texttt{</g>} labels around the invariable part of
7539 multi-word verbs.}
7540
7541 \begin{itemize}
7542
7543 \item Spanish monolingual dictionary:
7544 \begin{small}
7545 \begin{alltt}
7546 <\textbf{e} \textsl{lm}="darse cuenta">
7547     <\textbf{i}>d</\textbf{i}>
7548     <\textbf{par} \textsl{n}="d/ar__vblex"/>
7549     <\textbf{p}>
7550       <\textbf{l}><\textbf{b}/>cuenta</\textbf{l}>
7551       <\textbf{r}><\textbf{g}><\textbf{b}/>cuenta</\textbf{g}></\textbf{r}>
7552     </\textbf{p}>
7553 </\textbf{e}>
7554 \end{alltt}
7555 \end{small}
7556
7557 \item Catalan monolingual dictionary:
7558 \begin{small}
7559 \begin{alltt}
7560 <\textbf{e} \textsl{lm}="adonar-se">
7561     <\textbf{i}>adon</\textbf{i}>
7562     <\textbf{par} \textsl{n}="abander/ar__vblex"/>
7563 </\textbf{e}>
7564 \end{alltt}
7565 \end{small}
7566
7567 \item Spanish-Catalan bilingual dictionary:
7568 \begin{small}
7569 \begin{alltt}
7570 <\textbf{e}>
7571   <\textbf{p}>
7572     <\textbf{l}>dar<\textbf{g}><\textbf{b}/>cuenta</\textbf{g}><\textbf{s} \textsl{n}="vblex"/></\textbf{l}>
7573     <\textbf{r}>adonar<\textbf{s} \textsl{n}="vblex"/></\textbf{r}>
7574   </\textbf{p}>
7575 </\textbf{e}>
7576 \end{alltt}
7577 \end{small}
7578
7579 \end{itemize}
7580
7581 The same principles and actions described for basic entries (gender
7582 and number change, direction restrictions, etc.) apply to all kinds of
7583 multiwords. For a more detailed description of multiword units, refer
7584 to section~\ref{ss:multipalabras}.
7585
7586 \subsection{Consider contributing your improved lexical data}
7587
7588 If you have successfully added general-purpose lexical data to any of
7589 the Apertium language pairs, please consider contributing it to the
7590 project so that we can offer a better toolbox to the community.  You
7591 can e-mail your data (in three XML files, one for each monolingual
7592 dictionary and another one for the bilingual dictionary) to the
7593 following addresses: \\
7594
7595 \begin{tabular}{ll}
7596 Spanish-Catalan data & Mireia Ginestí: \texttt{mginesti@dlsi.ua.es}\\
7597 Spanish-Portuguese data & Carme Armentano: \texttt{carmentano@dlsi.ua.es}\footnote{The group at the
7598 Universitat d'Alacant has also developed data for this language pair
7599 outside the present project.}\\
7600 Spanish-Galician data & Xavier Gómez-Guinovart: \texttt{xgg@uvigo.es}\\\\
7601
7602 \end{tabular}
7603
7604
7605 If you believe you are going to contribute more heavily to the
7606 project, you can join the development team through
7607 www.sourceforge.net. If you do not have a Sourceforge account, please
7608 create one; then write to Mikel L. Forcada (\texttt{mlf@ua.es}) or
7609 Sergio Ortiz (\texttt{sortiz@dlsi.ua.es}), or to Xavier Gómez
7610 Guinovart if you are interested in the Spanish-Galician language pair,
7611 explaining briefly your motivations and background to join the
7612 project.  The usual way to contribute is to use CVS; as a project
7613 member, you will be able to commit your changes to dictionaries
7614 directly.
7615
7616 The addition of simple lexical contributions will soon be made simpler
7617 by means of web forms in
7618 \url{http://xixona.dlsi.ua.es/prototype/webform/}, so that
7619 contributors do not have to deal directly with XML.
7620
7621
7622 You should be aware that the data you contribute to the project, once
7623 added, will be freely distributed under the current license (GNU
7624 General Public License or Creative Commons 2.5
7625 attribution-sharealike-noncommercial, as indicated). Make sure the
7626 data you contribute is not affected by any kind of license which may
7627 be incompatible with the licenses used in this project. No kind of
7628 agreement or contract is created between you and the developers. If
7629 you have any doubt, or you plan to make a massive contribution,
7630 contact Mikel L. Forcada.
7631
7632
7633 \section[Adding structural transfer rules]{Adding structural transfer
7634 (grammar) rules}
7635
7636 The content in this chapter partially repeats information already
7637 presented in the chapter describing the structural transfer module
7638 (Section \ref{ss:transfer}), although rules are described here in a
7639 more general and practical way, aimed at those who wish a first
7640 approach to them.
7641
7642 Structural transfer rules carry out transformations to the analysed
7643 and disambiguated text, which are needed because of grammatical,
7644 syntactical and lexical divergences between the two languages involved
7645 (gender and number changes to ensure agreement in the target language,
7646 word reorderings, changes in prepositions, etc.). The rules detect
7647 patterns (sequences) of source text lexical forms and apply to them
7648 the corresponding transformations.  The module detects the patterns in
7649 a left-to-right, longest-match way; for example, the phrase \emph{the
7650 big cat} will be detected and processed by the rule for
7651 \emph{determiner}--\emph{adjective}--\emph{noun} and not by the rule
7652 for \emph{determiner}--\emph{adjective}, since the first pattern is
7653 longer. If two patterns have the same length, the rule that applies is
7654 the one defined in the first place.
7655
7656 The structural transfer module (generated from the structural transfer
7657 rules file) calls the lexical transfer module (generated from the
7658 bilingual dictionary) all through the process to determine the target
7659 language equivalents of the source language lexical forms.
7660
7661 The structural transfer rules are contained in a XML file, one for
7662 each translation direction (for example, for the translation from
7663 Spanish to Catalan, the file is
7664 \texttt{apertium-es-ca.trules-es-ca.xml}). You need to edit this file
7665 if you want to add or change transfer rules.
7666
7667 Rules have a \textbf{pattern} and an \textbf{action} part. The pattern
7668 specifies which sequences of lexical forms have to be detected and
7669 processed. The action describes the verifications and transformations
7670 that need to be done on its constituents. Usual transformation
7671 operations (such as gender and number agreement) are defined inside a
7672 macroinstruction which is called inside the rule.  At the end of the
7673 action part of the rule, the resulting lexical forms in the target
7674 language are sent out so that they are processed by the next modules
7675 in the translation system.
7676
7677 A transfer rules file contains four sections with definitions of
7678 elements used in the rules, and a fifth section where the actual rules
7679 are defined. The sections are the following:
7680
7681 \begin{itemize}
7682
7683 \item \texttt{<section-def-cats>}: This section contains the
7684   definition of the categories which are to be used in the rule
7685   patterns (that is, the type of lexical forms that will be detected
7686   by a certain rule). For the rule presented below, the categories
7687   \texttt{det} and \texttt{nom} (determiner and noun) need to be
7688   defined here. Categories are defined specifying the grammatical
7689   symbols that the lexical forms have. An asterisk indicates that one
7690   or more grammatical symbols follow the ones specified. The following
7691   is the definition of the category \texttt{det}, which groups
7692   determiners and predeterminers\footnote{such as in Spanish
7693   \emph{todo}, \emph{toda}, \emph{todos}, \emph{todas}} in the same
7694   category since they play the same role for transfer purposes:
7695
7696 \begin{small}
7697 \begin{alltt}
7698 <\textbf{def-cat} \textsl{n}="det">
7699     <\textbf{cat-item} \textsl{tags}="det.*"/>
7700     <\textbf{cat-item} \textsl{tags}="predet.*"/>
7701 </\textbf{def-cat}>
7702 \end{alltt}
7703 \end{small}
7704
7705 It is also possible to define as a category a certain lemma, like the
7706 following for the preposition \texttt{en}:
7707
7708 \begin{small}
7709 \begin{alltt}
7710 <\textbf{def-cat} \textsl{n}="en">
7711     <\textbf{cat-item} \textsl{lemma}="en" \textsl{tags}="pr"/>
7712 </\textbf{def-cat}>
7713 \end{alltt}
7714 \end{small}
7715
7716
7717 \item \texttt{<section-def-attrs>}: This section contains the
7718 definition of the attributes that will be used inside of the rules, in
7719 the action part. You need attributes for all the categories defined in
7720 the previous section, if they are to be used in the action part of the
7721 rule (to make verifications on them or to send them out at the end of
7722 the rule), as well as for other attributes needed in the rule (such as
7723 gender or number). Attributes have to be defined using their
7724 corresponding grammatical symbols and can not have asterisks; its name
7725 must be unique. The following are the definitions for the attributes
7726 \texttt{a\_det} (for determiners) and \texttt{gen} (gender):
7727
7728 \begin{small}
7729 \begin{alltt}
7730 <\textbf{def-attr} \textsl{n}="a_det">
7731     <\textbf{attr-item} \textsl{tags}="det.def"/>
7732     <\textbf{attr-item} \textsl{tags}="det.ind"/>
7733     <\textbf{attr-item} \textsl{tags}="det.dem"/>
7734     <\textbf{attr-item} \textsl{tags}="det.pos"/>
7735     <\textbf{attr-item} \textsl{tags}="predet"/>
7736 </\textbf{def-attr}>
7737
7738 <\textbf{def-attr} \textsl{n}="gen">
7739     <\textbf{attr-item} \textsl{tags}="m"/>
7740     <\textbf{attr-item} \textsl{tags}="f"/>
7741     <\textbf{attr-item} \textsl{tags}="mf"/>
7742     <\textbf{attr-item} \textsl{tags}="nt"/>
7743     <\textbf{attr-item} \textsl{tags}="GD"/>
7744 </\textbf{def-attr}>
7745
7746 \end{alltt}
7747 \end{small}
7748
7749 \item \texttt{<section-def-vars>}: This section contains the
7750 definition of the variables used in the rules.
7751
7752 \begin{small}
7753 \begin{alltt}
7754   <\textbf{def-var} \textsl{n}="interrogativa"/>
7755 \end{alltt}
7756 \end{small}
7757
7758 \item \texttt{<section-def-macros>}: Here the macroinstructions are
7759 defined, which contain sequences of code that are frequently used in
7760 the rules; this way, linguists do not need to write the same actions
7761 repeatedly. There are, for example, macroinstructions for gender and
7762 number agreement operations.
7763
7764 \item \texttt{<section-def-rules>}: This is the section where the
7765 structural transfer rules are written.
7766
7767 \end{itemize}
7768
7769 The following is an example of a rule which detects the sequence
7770 \emph{determiner--noun}:
7771
7772 \begin{small}
7773 \begin{alltt}
7774 <\textbf{rule}>
7775   <\textbf{pattern}>
7776     <\textbf{pattern-item} \textsl{n}="det"/>
7777     <\textbf{pattern-item} \textsl{n}="nom"/>
7778   <\textbf{/pattern}>
7779   <\textbf{action}>
7780     <\textbf{call-macro} \textsl{n}="f_concord2">
7781       <\textbf{with-param} \textsl{pos}="2"/>
7782       <\textbf{with-param} \textsl{pos}="1"/>
7783     </\textbf{call-macro}>
7784     <\textbf{out}>
7785       <\textbf{lu}>
7786         <\textbf{clip} \textsl{pos}="1" \textsl{side}="tl" \textsl{part}="whole"/>
7787       </\textbf{lu}>
7788       <\textbf{b} \textsl{pos}="1"/>
7789       <\textbf{lu}>
7790         <\textbf{clip} \textsl{pos}="2" \textsl{side}="tl" \textsl{part}="whole"/>
7791       </\textbf{lu}>
7792     </\textbf{out}>
7793   </\textbf{action}>
7794 </\textbf{rule}>
7795 \end{alltt}
7796 \end{small}
7797
7798 Part of the action performed on this pattern is specified inside the
7799 macroinstruction \texttt{f\_concord2}, which is defined in the
7800 \texttt{<section-def-macros>}. It performs gender and number agreement
7801 operations: if there is a gender or number change between the source
7802 language and the target language (in the noun), the determiner changes
7803 its gender or number accordingly; furthermore, if gender or number are
7804 undetermined (\texttt{GD} or \texttt{ND}\footnote{See pages
7805 \pageref{pg:GD} or \pageref{GDND}}), the noun receives the correct
7806 gender or number values from the preceding determiner. In the Apertium
7807 es--ca, es--gl and es--pt systems, there are agreement
7808 macroinstructions defined for one, two, three or four lexical units
7809 (\texttt{f\_concord1}, \texttt{f\_concord2}, \texttt{f\_concord3},
7810 \texttt{f\_concord4}). When calling the macroinstructions in a rule,
7811 it must be specified which is the main lexical unit (the one which
7812 most heavily determines the gender or number of the other lexical
7813 units) and which other lexical units of the pattern have to be
7814 included in the agreement operations, in order of importance. This is
7815 done with the \texttt{<with-param pos=""/>} element. In the presented
7816 rule, the main lexical unit is the noun (position "2" in the pattern)
7817 and the second one is the determiner (positions "1" in the pattern).
7818
7819 After the pertinent actions, the resulting lexical forms are sent out,
7820 inside the \texttt{<out>} element. Each lexical unit is defined with a
7821 \texttt{<clip>}. Its attributes mean the following:
7822
7823 \begin{itemize}
7824
7825 \item [-]\texttt{pos}: refers to the position of the lexical form in
7826 the pattern; \texttt{1} is the first lexical form (the determiner) and
7827 \texttt{2} the second one (the noun).
7828
7829 \item [-]\texttt{side}: indicates if the lexical form is in the source
7830 language (\texttt{sl}) or in the target language (\texttt{tl}).  Of
7831 course, words are sent out always in the target language; source
7832 language lexical forms may be needed inside of a rule, when testing
7833 its attributes or characteristics.
7834
7835 \item [-]\texttt{part}: indicates which part of the lexical form is
7836 referred to in the \texttt{clip}. You can use some predefined values:
7837
7838 \begin{itemize}
7839
7840 \item [-]\texttt{whole}: the whole lexical form (lemma and grammatical
7841 symbols). Used only when sending out the lexical unit (inside an
7842 \texttt{<out>} element).
7843
7844 \item [-]\texttt{lem}: the lemma of the lexical unit
7845
7846 \item [-]\texttt{lemh}: the head of the lemma of a multiword with
7847 inner inflection (see Section \ref{multi} in this chapter, or
7848 Section~\ref{ss:multipalabras} if you wish a more detailed
7849 description)
7850
7851 \item [-]\texttt{lemq}: the queue of a lemma of a multiword with inner
7852 inflection
7853
7854
7855 \end{itemize}
7856
7857 Apart from these predefined values, you can use any of the attributes
7858 defined in \texttt{<section-def-attrs>} (for example \texttt{gen} or
7859 \texttt{a\_det}).
7860
7861 The values \texttt{lemh} and \texttt{lemq} are used when sending out
7862 multiwords with inner inflection in order to place the head and the
7863 queue of the lemma in the right position, since the previous module
7864 moved the queue just after the lemma head for various reasons. In
7865 practice, in our system, this means that you must use these values
7866 instead of \texttt{whole} when sending out verbs. This is because, in
7867 our dictionaries, multiwords with inner inflection are always verbs
7868 \nota{NEEDS UPDATING}and, if you use the value \texttt{whole} when
7869 sending them out, the multiword would not be well formed (the head and
7870 the queue of the lemma would not have the correct position and the
7871 multiword could not be generated by the generator).
7872
7873 \end{itemize}
7874
7875
7876 Therefore, a rule that has a verb in its pattern must send the lexical
7877 forms like in the following two examples:
7878
7879 \label{regla_verbo1}
7880 \begin{small}
7881 \begin{alltt}
7882 <\textbf{rule}>
7883   <\textbf{pattern}>
7884     <\textbf{pattern-item} \textsl{n}="verb"/>
7885   <\textbf{/pattern}>
7886   <\textbf{action}>
7887     <\textbf{out}>
7888       <\textbf{lu}>
7889         <\textbf{clip} \textsl{pos}="1" \textsl{side}="tl" \textsl{part}="lemh"/>
7890         <\textbf{clip} \textsl{pos}="1" \textsl{side}="tl" \textsl{part}="a_verb"/>
7891         <\textbf{clip} \textsl{pos}="1" \textsl{side}="tl" \textsl{part}="temps"/>
7892         <\textbf{clip} \textsl{pos}="1" \textsl{side}="tl" \textsl{part}="persona"/>
7893         <\textbf{clip} \textsl{pos}="1" \textsl{side}="tl" \textsl{part}="gen"/>
7894         <\textbf{clip} \textsl{pos}="1" \textsl{side}="tl" \textsl{part}="nbr"/>
7895         <\textbf{clip} \textsl{pos}="1" \textsl{side}="tl" \textsl{part}="lemq"/>
7896       </\textbf{lu}>
7897     </\textbf{out}>
7898   </\textbf{action}>
7899 </\textbf{rule}>
7900 \end{alltt}
7901 \end{small}
7902
7903
7904 \label{regla_verbo2}
7905 \begin{small}
7906 \begin{alltt}
7907 <\textbf{rule}>
7908   <\textbf{pattern}>
7909     <\textbf{pattern-item} \textsl{n}="verb"/>
7910     <\textbf{pattern-item} \textsl{n}="prnenc"/>
7911   <\textbf{/pattern}>
7912   <\textbf{action}>
7913     <\textbf{out}>
7914       <\textbf{mlu}>
7915         <\textbf{lu}>
7916           <\textbf{clip} \textsl{pos}="1" \textsl{side}="tl" \textsl{part}="lemh"/>
7917           <\textbf{clip} \textsl{pos}="1" \textsl{side}="tl" \textsl{part}="a_verb"/>
7918           <\textbf{clip} \textsl{pos}="1" \textsl{side}="tl" \textsl{part}="temps"/>
7919           <\textbf{clip} \textsl{pos}="1" \textsl{side}="tl" \textsl{part}="persona"/>
7920           <\textbf{clip} \textsl{pos}="1" \textsl{side}="tl" \textsl{part}="nbr"/>
7921         </\textbf{lu}>
7922         <\textbf{lu}>
7923           <\textbf{clip} \textsl{pos}="2" \textsl{side}="tl" \textsl{part}="lem"/>
7924           <\textbf{clip} \textsl{pos}="2" \textsl{side}="tl" \textsl{part}="a_prnenc"/>
7925           <\textbf{clip} \textsl{pos}="2" \textsl{side}="tl" \textsl{part}="persona"/>
7926           <\textbf{clip} \textsl{pos}="2" \textsl{side}="tl" \textsl{part}="gen"/>
7927           <\textbf{clip} \textsl{pos}="2" \textsl{side}="tl" \textsl{part}="nbr"/>
7928           <\textbf{clip} \textsl{pos}="1" \textsl{side}="tl" \textsl{part}="lemq"/>
7929         </\textbf{lu}>
7930       </\textbf{mlu}>
7931     </\textbf{out}>
7932   </\textbf{action}>
7933 </\textbf{rule}>
7934 \end{alltt}
7935 \end{small}
7936
7937
7938 The first rule detects a verb and places the queue in the correct
7939 place, after all the grammatical symbols. The lexical unit is sent
7940 specifying the attributes separately: lemma head, lexical category
7941 (verb), tense, person, gender (for the participles), number and lemma
7942 queue.
7943
7944 The second rule detects a verb followed by an enclitic pronoun and
7945 sends the two lexical forms specifying also the attributes separately;
7946 the first lexical unit consists of: lemma head, lexical category
7947 (verb), tense, person and number; the second lexical unit consists of:
7948 lemma, lexical category (enclitic pronoun), person, gender, number and
7949 lemma queue (of the first lexical form). This way, the queue of the
7950 lemma is placed after the enclitic pronoun. The two lexical units
7951 (verb and enclitic pronoun) are sent inside a \texttt{<mlu>} element,
7952 since they have to reach the morphological generator as a multilexical
7953 unit (multiword).
7954
7955
7956 Taking into account what we have explained here, if you want to
7957 \textbf{add a new transfer rule} you have to follow these steps:
7958
7959 \begin{enumerate}
7960
7961 \item Specify which pattern you want to detect. Bear in mind that
7962 words are processed only once by a rule, and that rules are applied
7963 left to right and choosing the longest match.  For example, imagine
7964 you have in your transfer rules file only two rules, one for the
7965 pattern \emph{determiner--noun} and one for the pattern
7966 \emph{noun--adjective}.  The Spanish phrase \emph{el valle verde}
7967 ("the green valley") would be detected and processed by the first one,
7968 not by the second. You will need to add a rule for the pattern
7969 \emph{determiner - noun - adjective} if you wish that the three
7970 lexical units are processed in the same pattern.
7971
7972 \item Describe the operations you want to perform on the pattern. In
7973 the Apertium \texttt{es-ca}, \texttt{es-gl} and \texttt{es-pt}
7974 systems, simple agreement operations (gender and number agreement) are
7975 easy to perform in a rule by means of a macroinstruction. To perform
7976 other operations, you will need to use more complicated elements; for
7977 a more detailed description of the language used to create rules,
7978 refer to the section \ref{formatotransfer}.
7979
7980 \item Send the lexical units of the pattern in the target language
7981 inside an \texttt{<out>} element. Each lexical unit must be included
7982 in a \texttt{<lu>} element. If two or more lexical units must be
7983 generated as a multilexical unit (only for enclitic pronouns in the
7984 present language pairs) , they must be grouped inside a \texttt{<mlu>}
7985 element.
7986
7987 All the words that are detected by a rule (that are part of a pattern)
7988 must be sent out at the end of the rule so that the next module (the
7989 generator) receives them. If a lexical unit is detected by a pattern
7990 and is not included in the \texttt{<out>} element, it will not be
7991 generated.
7992
7993
7994 \end{enumerate}
7995
7996
7997 \section[Adding data for the part-of-speech tagger]{Adding data for
7998 the lexical categorial disambiguator (part-of-speech tagger)}
7999
8000 The lexical categorial disambiguator takes the linguistic information
8001 needed to disambiguate a text basically from two sources: a tagset
8002 definition file and corpora. The tagset definition file is contained
8003 in the linguistic data directory and its name has the structure
8004 \texttt{apertium-PAIR.LANG.tsx}, whereas corpora information is
8005 contained in the \texttt{LANG-tagger-data} directory included in the
8006 previous directory.
8007
8008 The \emph{tagset definition file} contains the definition of the
8009 coarse tags (or categories) used by the tagger when being trained and
8010 when disambiguating a text, as well as tag co-occurrence restrictions
8011 that help obtain better tag probabilities. In Section \ref{ss:tagger}
8012 you can find a detailed description of its characteristics.
8013
8014 The \emph{corpora} that need to be in the \texttt{LANG-tagger-data}
8015 directory are different depending on whether the tagger is trained in
8016 a supervised way (with manually disambiguated text) or unsupervised
8017 (without manually disambiguated text):
8018
8019 \begin{itemize}
8020
8021 \item to train the tagger in a supervised way you need the files
8022 (examples from es-tagger-data): \texttt{es.tagged.txt},
8023 \texttt{es.untagged}, \texttt{es.tagged}, \texttt{es.dic}.
8024
8025 \item to train the tagger in an unsupervised way you need the files
8026 (examples from es-tagger-data): \texttt{es.crp.txt}, \texttt{es.crp},
8027 \texttt{es.dic}
8028
8029 \end{itemize}
8030
8031 These files have the following characteristics:
8032
8033 \begin{itemize}
8034
8035 \item \texttt{es.tagged.txt}: A Spanish corpus in plain text format.
8036 \item \texttt{es.untagged}: The corpus \texttt{es.tagged.txt}
8037 morphologically analysed, which means, processed by the de-formatter
8038 and the morphological analyser (automatically generated corpus).
8039 \item \texttt{es.tagged}: The preceding corpus manually disambiguated.
8040 \item \texttt{es.crp.txt}: A large corpus (hundreds of thousands of
8041 words) used when training the tagger in an unsupervised way with
8042 Baum-Welch reestimation.
8043 \item \texttt{es.crp}: The preceding corpus processed consecutively by
8044 the de-formatter and the morphological analyser (automatically
8045 generated corpus).
8046 \item \texttt{es.dic}: File created from the Spanish monolingual
8047 dictionary \texttt{*.es.dix}, by means of the \texttt{lt-expand} and
8048 \texttt{aper\-tium\--fil\-ter\--am\-biguity} tools, which expand the
8049 dictionary and filter the ambiguity classes, so that the file contains
8050 all the forms identified as different ambiguity classes by the tagger
8051 defined with \texttt{*.es.tsx}; that is, which lexical categories can
8052 be homographs (automatically generated corpus).
8053 \end{itemize}
8054
8055 When downloading Apertium from Sourceforge
8056 (\url{http://apertium.sourceforge.net/}), if the tagger has been
8057 trained in a supervised way, it is probable that you get the files
8058 needed for this kind of training, \texttt{es.tagged} and
8059 \texttt{es.tagged.txt} (for Spanish). The other required files are
8060 automatically generated when running the training.  If the tagger has
8061 been trained in an unsupervised way, you will not get any corpus in
8062 the download since the files required for this kind of training are
8063 huge. If you wish to train the tagger with this method, you will need
8064 to collect a large corpus and name it \texttt{es.crp.txt}. The other
8065 required files are automatically generated when running the training.
8066
8067 Anyway, the Apertium translator comes with all the data required for a
8068 good performance of the tagger. You don't need to train the tagger in
8069 order to use Apertium. A retraining might be required in the case that
8070 you have made really extensive changes to the dictionaries or you have
8071 modified the tagset definition file.
8072
8073 Therefore, the tagger data can be modified in two ways:
8074
8075 \begin{enumerate}
8076
8077 \item Change the tagset definition file. You can add, change or delete
8078 the coarse tags used by the tagger, if you think that a new category
8079 could be useful for the disambiguation or that a certain category
8080 should be modified to obtain better results. You can also add
8081 restrictions (for example, you can forbid the sequence
8082 determiner--determiner if this is an impossible combination in a given
8083 language and can help in the disambiguation of certain homograph
8084 words).
8085
8086 \item Modify the corpora used to train the tagger.  You can modify the
8087 manually disambiguated text (\texttt{es.tagged} for Spanish) if you
8088 think that certain tags have been wrongly selected. You can also add
8089 sentences to this text (and to \texttt{es.tagged.txt}, used to automatically
8090 generate the corpus \texttt{es.untagged}) in order to
8091 add information to the tagger, since it is possible that certain
8092 combinations are incorrectly disambiguated because the tagger has not
8093 found them in the training corpora.
8094
8095
8096 \end{enumerate}
8097
8098 There are two commands to run the training:
8099
8100 \begin{itemize}
8101
8102 \item to train in a supervised way, type, in the directory containing
8103 the linguistic data (example for \emph{es}--\emph{ca}): \texttt{make
8104 -f es-ca-supervised.make}
8105
8106
8107 \item to train in an unsupervised way, type, in the directory
8108 containing the linguistic data (example for \emph{es}--\emph{ca}):
8109 \texttt{make -f es-ca-unsupervised.make}
8110
8111
8112 \end{itemize}
8113
8114 In both cases, planned files will be automatically generated.
8115
8116
8117 \section{Detecting errors}
8118 \label{errores}
8119
8120
8121 It is easy to make errors when adding new words or transfer rules to
8122 the Apertium system.
8123
8124 On the one hand, it is possible that, when compiling the new files,
8125 the system displays an error message. In this case, this is a formal
8126 error (a missing XML tag, a tag that is not allowed in a certain
8127 context, etc.).  You just have to go to the line number indicated by
8128 the error message, correct the error and compile again. On the other
8129 hand, there are other types of errors not detected when compiling, but
8130 which can make the system mistranslate a word or give an
8131 incomprehensible text string.  These are linguistic errors, which can
8132 be detected and corrected with the tips given in this chapter. The
8133 following information is for Linux users, since Apertium works for the
8134 moment only in this operating system.\footnote{There are in
8135 \url{http://apertium.org} experimental packages for Windows with fixed
8136 linguistic data (non-modifiable binary files).}
8137
8138 \subsection{Adjusting error symbols}
8139 \label{subsec:marcaserror}
8140
8141 When the system encounters a problem to translate any word of a source
8142 language text, in the default mode the system outputs the problematic
8143 word together with a symbol that indicates that an error has occurred.
8144 The meaning of the different symbols is the following:
8145
8146
8147
8148 \begin{itemize}
8149
8150
8151 \item '\verb!@!': The problem is in the lexical transfer module, which
8152 can not translate the lexical form (the bilingual dictionary does not
8153 contain it)
8154
8155 \item '\verb!#!': The problem has occurred in the generator, which can
8156 not generate the surface form from the input lexical form (the
8157 morphological dictionary does not contain it in the generation
8158 direction)
8159
8160 \item '\verb!/!': This symbol separates two or more surface forms
8161 delivered by the generator. The problem, therefore, is in the target
8162 language monolingual dictionary, which has, in the generation
8163 direction, two surface forms for a single lexical form, when it should
8164 have only one.
8165
8166
8167 \end{itemize}
8168
8169
8170 The generation module has three modes, which enable us to decide how
8171 errors will be displayed in the final output.  The three possible
8172 parameters are:
8173
8174 \begin{itemize}
8175
8176 \item -n : error symbols and the unknown-word symbol will NOT be
8177 displayed, and neither will any grammatical symbols
8178
8179 \item -g : error symbols and the unknown-word symbol will be displayed
8180 (default mode)
8181
8182 \item -d : error symbols and the unknown-word symbol will be
8183 displayed, as well as the grammatical symbols of the lexical forms
8184 producing the error.
8185
8186
8187 \end{itemize}
8188
8189
8190 The preferable mode depends on the type of user and on the translation
8191 purpose. The first option is the most suitable when the user does not
8192 want that external signs interfere in the reading of the
8193 translation. The second option is useful when the user wants the
8194 system to show where there has been a problem in the translation
8195 (errors or unknown words) in order to be able to post-edit it
8196 easily. The third option is ideal for linguistic developers of
8197 Apertium, since it displays all the linguistic information of the
8198 forms that produced an error.
8199
8200 Taking advantage of the error symbols output by the system, it is
8201 possible to carry out a thorough test of the dictionaries of a certain
8202 language pair. This will enable you to detect and correct all its
8203 errors. To learn how to do it, see Section \ref{integridad}.
8204
8205 \subsection{Output of the different Apertium modules}
8206
8207 Sometimes it is difficult to find the origin of an error. In such
8208 cases, it is useful to see the output of each of the modules.  As all
8209 the data processed by the system, from the original text to the
8210 translated text, circulate between the eight modules of the system in
8211 text format, it is possible to stop the text stream at any point to
8212 know what is the input or the output of a certain module.
8213
8214 Using a pipeline structure and the \texttt{echo} or \texttt{cat}
8215 commands, you can send a text through one or more modules to analyse
8216 their output and detect the origin of the error. We describe next how
8217 to do it. You have to move to the directory where the linguistic data
8218 are saved and type the described commands.
8219
8220
8221
8222 \subsubsection{The morphological analyser output}
8223
8224 To know how a word is analyzed by the translator, type the following
8225 in the terminal (example for the Catalan word \emph{sabates}):
8226
8227
8228 \begin{small}
8229 \begin{alltt}
8230 echo "sabates" | apertium-destxt | lt-proc ca-es.automorf.bin
8231 \end{alltt}
8232 \end{small}
8233
8234 You can replace \texttt{ca-es} with the translation direction you want
8235 to test.
8236
8237 The output in Apertium should be:
8238 \begin{small}
8239 \begin{alltt}
8240 ^sabates/sabata<n><f><pl>\$^./.<sent>\$[][]
8241 \end{alltt}
8242 \end{small}
8243
8244 The string structure is
8245 \verb!^!\texttt{word/lemma<}\textsl{morphological
8246 analysis}\texttt{>}\verb!$!. The \texttt{<sent>} tag is the analysis
8247 of the full stop, as every sentence end is represented as a full stop
8248 by the system, whether or not explicitly indicated in the sentence.
8249
8250 The analysis of an unknown word is (ignoring the full stop info):
8251
8252 \begin{small}
8253 \begin{alltt}
8254 ^genoma/*genoma\$
8255 \end{alltt}
8256 \end{small}
8257
8258 \noindent and the analysis of an ambiguous word:
8259
8260 \begin{small}
8261 \begin{alltt}
8262 ^casa/casa<n><f><sg>/casar<vblex><pri><p3><sg>/casar<vblex><imp><p2><sg>\$
8263 \end{alltt}
8264 \end{small}
8265
8266 Each lexical form (lemma plus morphological analysis) is presented as
8267 a possible analysis of the word \emph{casa}.
8268
8269 \subsubsection{The tagger output}
8270
8271
8272 To know the output of the tagger for a source language text, type the
8273 following in the terminal (example for the Catalan-Spanish direction):
8274
8275 \begin{small}
8276 \begin{alltt}
8277 echo "sabates" | apertium-destxt | lt-proc ca-es.automorf.bin \\|apertium-tagger -g ca-es.prob
8278 \end{alltt}
8279 \end{small}
8280
8281 The output will be:
8282 \begin{small}
8283 \begin{alltt}
8284 ^sabata<n><f><pl>\$^./.<sent>\$[][]
8285 \end{alltt}
8286 \end{small}
8287
8288 The output for an ambiguous word will be like the one above, since the
8289 tagger chooses one lexical form among all the
8290 possibilities. Therefore, the output for \emph{casa} in Catalan will
8291 be, for example (depending on the context):
8292
8293 \begin{small}
8294 \begin{alltt}
8295 ^casa<n><f><sg>\$^.<sent>\$[][]
8296 \end{alltt}
8297 \end{small}
8298
8299 \subsubsection{The \texttt{pretransfer} output}
8300
8301 This module applies some changes to multiwords (move the lemma queue
8302 of a multiword with inner inflection just after the lemma head). To
8303 know its output, type:
8304
8305 \begin{small}
8306 \begin{alltt}
8307 echo "sabates" | apertium-destxt | lt-proc ca-es.automorf.bin \\|apertium-tagger -g ca-es.prob | apertium-pretransfer
8308 \end{alltt}
8309 \end{small}
8310
8311 Since \emph{sabates} is not a multiword, this module does not alter
8312 its input.
8313
8314 \subsubsection{The structural and lexical transfer output}
8315
8316 To know how a word, phrase or sentence is translated into the target
8317 language and processed by structural transfer rules, type the
8318 following in the terminal:
8319 \begin{small}
8320 \begin{alltt}
8321 echo "sabates" | apertium-destxt | lt-proc ca-es.automorf.bin \\|apertium-tagger -g ca-es.prob | apertium-pretransfer \\| ./ca-es.transfer ca-es.autobil.bin
8322 \end{alltt}
8323 \end{small}
8324
8325 The output for this word will be:
8326
8327 \begin{small}
8328 \begin{alltt}
8329 ^zapato<n><m><pl>\$^.<sent>\$[][]
8330 \end{alltt}
8331 \end{small}
8332
8333
8334 Analysing how a word or phrase is output by this module can help you
8335 detect errors in the bilingual dictionary or in the structural
8336 transfer rules. Typical bilingual dictionary errors are: two
8337 equivalents for the same source language lexical form, or wrong
8338 assignment of grammatical symbols. Errors due to structural transfer
8339 rules vary a lot depending on the actions performed by the rules.
8340
8341
8342 \subsubsection{The morphological generator output}
8343
8344 To know how a word is generated by the system, type the following in
8345 the terminal:
8346
8347 \begin{small}
8348 \begin{alltt}
8349 echo "sabates" | apertium-destxt | lt-proc ca-es.automorf.bin \\|apertium-tagger -g ca-es.prob | apertium-pretransfer \\| ./ca-es.transfer ca-es.autobil.bin | ltproc -g ca-es.autogen.bin
8350 \end{alltt}
8351 \end{small}
8352
8353 With this command you can detect generation errors due to an incorrect
8354 entry in the target language monolingual dictionary or to a divergence
8355 between the output of the bilingual dictionary (the output of the
8356 previous module) and the entry in the monolingual dictionary.
8357
8358 The correct output for the input \emph{sabates} would be:
8359
8360 \begin{small}
8361 \begin{alltt}
8362 zapatos.[][]
8363 \end{alltt}
8364 \end{small}
8365
8366 There are in this step no grammatical symbols, and the word appears
8367 inflected.
8368
8369 \subsubsection{The post-generator output}
8370
8371 It is not very usual to have errors due to the post-generator, because
8372 of its generally small size and the fact that it is seldom changed
8373 after adding usual combinations, but you can also test how a source
8374 language text comes out of this module, by typing:
8375
8376 \begin{small}
8377 \begin{alltt}
8378 echo "sabates" | apertium-destxt | lt-proc ca-es.automorf.bin \\|apertium-tagger -g ca-es.prob | apertium-pretransfer \\| ./ca-es.transfer ca-es.autobil.bin | ltproc -g ca-es.autogen.bin \\| ltproc -p es-ca.autopgen.bin
8379 \end{alltt}
8380 \end{small}
8381
8382 \subsubsection{The Apertium output}
8383
8384 You can put all the modules of the system in the pipeline structure
8385 and see how a source language text goes through all the modules and
8386 gets translated into the target language. You just have to add the
8387 re-formatter to the previous command:
8388
8389 \begin{small}
8390 \begin{alltt}
8391 echo "sabates" | apertium-destxt | lt-proc ca-es.automorf.bin \\|apertium-tagger -g ca-es.prob | apertium-pretransfer \\| ./ca-es.transfer ca-es.autobil.bin | ltproc -g ca-es.autogen.bin \\| ltproc -p es-ca.autopgen.bin | apertium-retxt
8392 \end{alltt}
8393 \end{small}
8394
8395 This is the same as using the \texttt{apertium-translator} shell
8396 script provided by the Apertium package:
8397
8398 \begin{small}
8399 \begin{alltt}
8400 echo "sabates" | apertium-translator . ca-es
8401 \end{alltt}
8402 \end{small}
8403
8404 \noindent (The dot indicates the directory where the linguistic data
8405 are saved, in this case the current directory).
8406
8407 Of course, instead of typing all the presented commands every time you
8408 need to test a translation, you can create shell scripts for every
8409 action and use them to test the output of each module.
8410
8411
8412
8413
8414 \subsection{Error examples}
8415
8416
8417 1) We can get the following kind of output in a translation:
8418
8419 \begin{small}
8420 \begin{alltt}
8421 \$ echo "nord" | apertium-translator . ca-es
8422 \$ #norte<n><m><sg>
8423 \end{alltt}
8424 \end{small}
8425
8426 This means that the word was correctly translated by the bilingual
8427 dictionary but that the system does not find it in the Spanish
8428 morphological dictionary to generate it. The problem can be in the
8429 morphological dictionary but can also be caused by an incorrect
8430 bilingual entry, in which the grammatical symbols that the translated
8431 word is assigned do not correspond with the grammatical symbols that
8432 this word has in the morphological dictionary.
8433
8434 2) The following \texttt{es-ca} bilingual entry does not take into
8435 account the gender change between \emph{adhesiu} (masculine) and
8436 \emph{pegatina} (feminine), causing the translator to give an error:
8437
8438 \begin{small}
8439 \begin{alltt}
8440 <\textbf{e}>
8441   <\textbf{p}>
8442     <\textbf{l}>pegatina<\textbf{s} \textsl{n}="n"/></\textbf{l}>
8443     <\textbf{r}>adhesiu<\textbf{s} \textsl{n}="n"/></\textbf{r}>
8444   </\textbf{p}>
8445 </\textbf{e}>
8446 \end{alltt}
8447 \end{small}
8448
8449 \begin{small}
8450 \begin{alltt}
8451 \$ echo "adhesiu" | apertium-translator . ca-es
8452 \$ #pegatina<n><m><sg>
8453 \end{alltt}
8454 \end{small}
8455
8456 The correct entry should be:
8457
8458 \begin{small}
8459 \begin{alltt}
8460 <\textbf{e}>
8461   <\textbf{p}>
8462     <\textbf{l}>pegatina<\textbf{s} \textsl{n}="n"/><\textbf{s} \textsl{n}="f"/></\textbf{l}>
8463     <\textbf{r}>adhesiu<\textbf{s} \textsl{n}="n"/><\textbf{s} \textsl{n}="m"/></\textbf{r}>
8464   </\textbf{p}>
8465 </\textbf{e}>
8466 \end{alltt}
8467 \end{small}
8468
8469 3) The following error is given when the source language lexical form
8470 can not be found in the bilingual dictionary, either because there is not an entry for this
8471 lemma or because the entry does not correspond with the grammatical
8472 symbols received from the analyser:
8473
8474
8475 \begin{small}
8476 \begin{alltt}
8477 \$ echo "illot" | apertium-translator . ca-es
8478 \$ @illot<n><m><sg>
8479 \end{alltt}
8480 \end{small}
8481
8482 4) When a source language lexical form has two correspondences in the
8483 bilingual dictionary, the translator output is like the following one:
8484
8485 \begin{small}
8486 \begin{alltt}
8487 \$ echo "llavor" | apertium-translator . ca-es
8488 \$ #pepita<n>/semilla<n><m><sg>
8489 \end{alltt}
8490 \end{small}
8491
8492 The solution is to put a direction restriction in one of the bilingual
8493 entries.
8494
8495
8496 Some errors can be due to structural transfer rules. The way to solve
8497 a problem whose origin we don't know, is to test the output of the
8498 different modules to detect where the problem arises.
8499
8500 \subsection{Testing the integrity of the
8501 dictionaries}\label{integridad}
8502
8503 It is highly advisable to test the integrity of our dictionaries from
8504 time to time, especially if we changed them significantly --or if we
8505 changed the transfer rules, because some errors can be due to its
8506 application.
8507
8508 The test is carried out in one translation direction. For this reason,
8509 for a given language pair, you will have to perform two tests, one in
8510 each direction.
8511
8512 The steps you have to follow to perform the test are:
8513
8514 \begin{itemize}
8515
8516 \item expand the source language monolingual dictionary, using the
8517 \texttt{lt-expand} tool, to obtain all the lexical forms (which are
8518 the forms that appear on the right of the colon in the output file);
8519
8520 \item send these lexical forms (except those that are only generation
8521 forms, which \texttt{lt-expand} will have marked with the symbol
8522 '\texttt{<}' ) through all the system modules from pretransfer to the
8523 generator;
8524
8525 \item Search in the result, the lexical forms marked with the symbols
8526 '\texttt{\#}' , '\texttt{@}' or '\texttt{/}', which will be the error
8527 forms (see Section~\ref{subsec:marcaserror}).
8528
8529
8530 \end{itemize}
8531
8532
8533
8534
8535 \section{Generating a new Apertium system from modified data}
8536
8537 If you make changes to any of the linguistic data files of Apertium
8538 (dictionaries, transfer rules or tagger definition file), the changes
8539 will not be applied until you recompile the modules. To do this, type
8540 \texttt{make} in the directory where the linguistic data are saved so
8541 that the system generates the new binary files.
8542
8543 If changes were made to the tagger definition file or to the corpora
8544 used to train the tagger, you will need also to retrain the tagger: in
8545 the same linguistic data directory, you have to type (example for the
8546 Spanish tagger in the es-ca translator) \texttt{make -f
8547 es-ca-unsupervised.make} for unsupervised training or \texttt{make -f
8548 es-ca-supervised.make} for supervised training.
8549
8550 After compilation, \texttt{apertium-translator} will already use the
8551 new data.
8552
8553 \newpage
8554
8555
8556 \chapter{Data insertion web forms}
8557
8558
8559
8560 This chapter describes the dictionary maintaining system in Apertium
8561 2.  It is organized in two sections.  Section \ref{ss:formadmin} gives
8562 the necessary information to install and adjust the web application
8563 for word insertion.  Section \ref{ss:formus} describes how to use the
8564 tool to add linguistic data.
8565
8566
8567 \section{Introduction}
8568
8569 Adding lemmas to the dictionaries of the different languages in
8570 Apertium is a slow task if you do it by manually editing the XML
8571 dictionaries; for this reason web forms have been created, which make
8572 the word insertion task considerably easier and, furthermore, allow
8573 the users to do it remotely from any computer with Internet access.
8574
8575 The tool consists of a set of forms written in \texttt{php} which can
8576 be used from any Internet navigator, either locally in the same
8577 computer where dictionaries are saved, or remotely.
8578
8579 \section{Installing and managing}
8580 \label{ss:formadmin}
8581
8582 \subsection{Installing the tool}
8583
8584 The installation must be done in a Unix machine which has an Apache
8585 web server with \texttt{php} installed. So, you will first need to
8586 install the \texttt{php} server if it is not installed, and then
8587 proceed to install the form tool.
8588
8589
8590 To install the tool, download the package
8591 \textit{`apertium-lexical-webform-0.9'} from the Apertium web page in
8592 Sourceforge (\url{http://apertium.sourceforge.net/}) and unpack it in
8593 the directory where you want to leave the tool.
8594
8595
8596 \begin{alltt}
8597    # cd /path/to the /forms tar -xvzf
8598    # /path/apertium-lexical-webform-0.9.tar.gz
8599 \end{alltt}
8600
8601 You must take into account that Apache only serves the pages that are
8602 in the root directory that we configured. Therefore, the directory
8603 where you place the forms must be a subdirectory inside the root
8604 directory of the Apache server.
8605
8606 Next, you have to edit the configuration file, which you can find in
8607 \textit{private/config.php}, and give the appropriate values to the
8608 configuration variables:
8609
8610 \begin{itemize}
8611 \item \texttt{\$anmor}: entire path of the morphological analyser
8612 \texttt{lt-proc}.
8613 \item \texttt{\$dicos\_path}: path to the directory where the final
8614 dictionaries and the compiled binaries of each dictionary are
8615 saved. This directory must contain a subdirectory for each dictionary
8616 with which the form can work. The subdirectory name must have the
8617 following structure: \texttt{paradigmes-ll-rr} , where \textit{ll} and
8618 \textit{rr} are the initials of the language pair involved. Each
8619 directory must contain the final dictionaries used by the machine
8620 translation system and the corresponding compiled binaries.  These
8621 directories can be replaced with symbolic links in the case that they
8622 are located in a different place.
8623 \item \texttt{\$usuaris\_professionals}: a list of the professional
8624 users in the system that have permission to insert words in the form
8625 dictionaries and to validate entries pending confirmation.
8626
8627 \item \texttt{\$mail}: E-mail address of the administrator of the
8628 forms. When someone wants to register as a user, an e-mail will be
8629 sent to this address.
8630 \end{itemize}
8631
8632 Once the parameters of this file have been configured, the forms
8633 server is already in use.
8634
8635
8636 \subsection{Directory structure}
8637
8638 All the files required by the application are structured as follows:
8639
8640 \begin{itemize}
8641 \item \texttt{/index.php:} displays the initial insertion form.  It
8642 has a section for each language pair, where the user inserts the SL
8643 lemma and the TL lemma and chooses the appropriate part of
8644 speech. After pressing the \textit{'Go on'} button, the next page is
8645 displayed, where the user has to select the appropriate inflection
8646 paradigms for the SL lemma and the TL lemma.
8647 \item \texttt{/dics:} directory that contains the dictionaries with
8648 the entries inserted from the forms. It contains the files with the
8649 entries from non-professional users (pending validation) and the
8650 dictionaries with the \texttt{XML} entries from professional users.
8651 \item \texttt{/private:} most modules used in the forms are saved
8652 here. It contains also the directories with the definition of
8653 paradigms for all the languages of the forms; these directories have
8654 the name \texttt{paradigmes-ll-rr}, where \textit{ll} and \textit{rr}
8655 are the initials of a given language pair. The order chosen for the
8656 two languages, first \textit{ll} and then \textit{rr}, depends on the
8657 order defined for entries in the bilingual dictionary. This directory
8658 contains also the files that carry out the whole processing of the
8659 words being inserted.  These files are:
8660 \begin{itemize}
8661 \item \texttt{resultado.php: } This \textit{php} is called when two
8662   words for any language pair are inserted from the module
8663   \textit{index.php}. Basically, what it does is to establish the
8664   language pair involved (\textit{\$LR} and \textit{\$RL}) and the
8665   part of speech of the words being inserted (\textit{\$tipus}). It is
8666   included in the \textit{selec.php} module, that is the next one
8667   called in the insertion process. In the case that the \textit{tipus}
8668   (\textit{type}) of the word being inserted is a multiword unit
8669   (\textit{Multi Word Verb}), then \textit{multip.php} is the module
8670   included and called instead of \textit{selec.php}. The \textit{Multi
8671   Word Verb} elements consist of a verb that can inflect followed by
8672   an invariable queue of one or more words (see Section
8673   \ref{ss:multipalabras} for a detailed description).
8674 \item \texttt{selecc.php: } This module is in charge of the selection
8675 of paradigms for the pair of words, the SL word and the TL word. It
8676 displays a list of paradigms to be chosen from, which depends on the
8677 part of speech of the entry being inserted. When a new paradigm is
8678 selected for a lemma, it displays some examples of inflected forms of
8679 the lemma according to the chosen paradigm. If the user accepts the
8680 chosen paradigms, the module calls \textit{insertarPro.php} or
8681 \textit{insertar.php} depending on whether the user is professional or
8682 non-professional respectively.
8683 \item \texttt{multip.php: } It has the same function as the
8684 \textit{selecc.php} module but for multiword units. It uses the same
8685 variables and performs the same actions, but in the examples
8686 displayed, the verb is inflected and the words of the queue are added
8687 after it. It works in an analogous way as the \textit{selecc.php}
8688 module, whose detailed description can be found in Section
8689 \ref{ss:fitxersphp}.
8690 \item \texttt{valida.php: } This module is called when a professional
8691   user wants to validate words that are in the queue of entries
8692   pending validation. It consults the file of words to be validated
8693   reading them one by one; it takes the data of the entry in turn
8694   (\textit{LRlem, RLlem, paradigmaLR, paradigmaRL, LR, RL}, etc.) and
8695   calls \textit{selecc.php} to continue with the insertion process of
8696   that specific entry.
8697 \item \texttt{insertarPro.php: } This module is called when the
8698 paradigms for the SL word and the TL word have already been selected
8699 (which was done in \textit{selecc.php}), and displays what the
8700 resulting \texttt{XML} entries will look like for the three
8701 dictionaries (SL monolingual, bilingual and TL monolingual) . From
8702 this screen it is possible to directly modify the code, and finally to
8703 accept the new entry or to cancel the operation.
8704 \item \texttt{ins\_multip.php: } It has the same function as
8705 \textit{insertarPro.php} but it is designed for multiword entries,
8706 therefore, the entry is treated differently so that the inserted
8707 \texttt{XML} code is correct.
8708 \item \texttt{insertar.php: } This module is equivalent to
8709 \textit{insertarPro.php} but for non-professional users. The actions
8710 it performs are much more simple, since the module just adds the
8711 lemmas and the paradigms selected by the non-professional user to the
8712 file of words to be validated; they remain in this file until a
8713 professional user validates them.
8714 \item \texttt{verSemi.php: } This module displays the file of entries
8715 inserted by non-professional users which are waiting for
8716 validation. It is useful for professional users who, before starting
8717 validating words, want to see which words are in the queue waiting for
8718 validation. It can be called from a link displayed in the form
8719 generated by \textit{selec.php}.
8720 \item \texttt{paradigmas.xsl:} Style sheet used to generate the
8721 paradigm files that are used by the form modules. It is used with the
8722 specification of paradigms of a language written in \texttt{XML}
8723 format. This question will be explained in more detail in Section \ref{paradigm}
8724 \textit{Paradigm files}.
8725 \item \texttt{creaparadigma.awk:} \texttt{awk} file used also to
8726 generate the mentioned paradigm files.
8727 \item \texttt{gen\_paradig.sh:} Script that can be used if we want to
8728 generate automatically the paradigm files for all the language pairs
8729 installed in our system.
8730 \end{itemize}
8731 \end{itemize}
8732
8733 In the next sections you will find a detailed description of the tasks
8734 of each module.
8735
8736 \subsection{Php files}
8737
8738 \subsubsection{resultado.php}
8739
8740 Depending on the value of the variable \texttt{\$nomtrad} updated by
8741 \textit{index.php}, the module assigns the appropriate values to
8742 \texttt{\$LR} and \texttt{\$RL} (source language and target language
8743 respectively). Then, according to the part of speech of the word being
8744 inserted, the variable \$tipo is assigned the appropriate value, and
8745 then \textit{selec.php} or \textit{multip.php} are called depending on
8746 whether the word is a simple unit or a multiword unit.  \nota{MG:
8747 ``asignamos'' i ``llamamos'' no seria més aviat ``se asigna'' y ``se
8748 llama''?}
8749
8750 \subsubsection{selecc.php}
8751 \label{ss:fitxersphp}
8752
8753 The function of this module is the selection of a paradigm for the
8754 words being inserted. The user will have to select a paradigm for the
8755 SL word and another one for the TL word.
8756
8757 There are a group of variables which, depending on the part of speech
8758 of the word, are assigned certain values that will be used at the end
8759 \nota{MG: "que darrerament s'utilitzaran" vol dir 'que s'utilitzaran
8760 al final'?}; these variables are:
8761 \begin{itemize}
8762 \item \texttt{cadFich:} part of speech of the lemma.
8763 \item \texttt{show:} string displayed in the form that indicates the
8764 part of speech of the word being inserted.
8765 \item \texttt{tag:} string with the \texttt{XML} tag output by the
8766 morphological analyser for this part of speech.
8767 \item \texttt{tagout:} string with the \texttt{XML} code that shows
8768 the part of speech of the word. This string will be used when building
8769 the final \texttt{XML} entry that will be inserted in the dictionary.
8770 \item \texttt{nota:} string with possible comments to be inserted in
8771 the \texttt{XML} code of the entry.
8772 \end{itemize} Forms work with 4 kinds of dictionaries:
8773 \begin{itemize}
8774 \item \textit{Semi-professional dictionaries}: They contain the words
8775 inserted from the form by non-professional users and which are pending
8776 validation. Their extension is "\textit{semi.dic}"
8777 \item \textit{Form dictionaries}: They contain the words inserted from
8778 the form by professional users, and also the ones that have been
8779 validated from the semi-professional dictionaries. Their extension is
8780 "\textit{webform}".
8781 \item \textit{Final dictionaries}: The files with all the entries
8782 written in \texttt{XML} code. These are the files finally used by the
8783 translator after being compiled. Their extension is "\textit{dix}".
8784 \item \textit{Final compiled dictionaries}: These are the compiled
8785 final dictionaries, which can already be used by the binaries of the
8786 translator. Their extension is "\textit{bin}"
8787 \end{itemize}
8788
8789 All these dictionaries are used by the forms; there are variables that
8790 contain the paths to them. Values are also assigned to variables that
8791 manage the paths to the auxiliary and the configuration files:
8792 \begin{itemize}
8793 \item \texttt{path:} path to the temporary dictionaries.
8794 \item \texttt{fich\_LR:} source language dictionary with the words
8795 inserted from the form that are not yet in the final dictionary nor in
8796 the compiled dictionary.
8797 \item \texttt{fich\_RL:} target language dictionary with the words
8798   inserted from the form that are not yet in the final dictionary nor
8799   in the compiled dictionary.  \nota{MG: I don't like speaking of SL and
8800     TL dictionaries, entries are for both directions, I think this is
8801     confusing. It should be changed in the whole chapter.}
8802 \item \texttt{fich\_LRRL:} bilingual dictionary with the words
8803 inserted from the form that are not yet in the final dictionary nor in
8804 the compiled dictionary.
8805 \item \texttt{fich-semi:} entries inserted from the form by
8806 non-professional users and which are pending validation.
8807 \item \texttt{path\_paradigmasLR:} path to the files that contain the
8808 inflection paradigms of the source language.
8809 \item \texttt{path\_paradigmasRL:} path to the files that contain the
8810 inflection paradigms of the target language.
8811 \item \texttt{anmor:} path to the morphological analyser.
8812 \item \texttt{aut\_LRRL:} path to the bilingual binary from source
8813 language to target language.\nota{MG: the original said "binario
8814 morfológico", I think it's an error, I wrote 'bilingual binary'}
8815 \item \texttt{aut\_RLLR:} path to the bilingual binary from target
8816 language to source language.\nota{MG: ídem ("bilingual").}
8817 \end{itemize}
8818
8819 Then the html code is inserted with the operations to be performed
8820 depending on the selected action. The actions performed by the module
8821 are the following, in sequential order:
8822
8823 \begin{itemize}
8824 \item Tests that the source language lemma being inserted is not
8825 already in the dictionaries containing the words inserted from the
8826 form. If \texttt{selecc.php} has been called from the word validation
8827 screen (\texttt{valida.php}), then the module tests that the lemma is
8828 not already in the file of words inserted by non-professional
8829 users. It tests this also in the final dictionary.
8830 \item Performs the same test for the target language.
8831 \item Code is written to select translation direction restrictions.
8832 \item A series of functions are defined, which will be used when
8833 generating the examples for the lemmas after the selection of the
8834 appropriate paradigm. These are:
8835   \begin{itemize}
8836   \item \texttt{esVocalFuerte}
8837   \item \texttt{esVocalDebil}
8838   \item \texttt{esVocal}
8839   \item \texttt{PosicioVocalTall}
8840   \end{itemize} These functions are described later in section
8841   \ref{insertarpro}.
8842 \item The paradigm file is opened to display a drop-down box with the
8843 paradigms that can be selected for the source language lemma. To do
8844 this, the program has to test sequentially the paradigms defined for
8845 the part of speech of the lemma, checking whether the paradigm can be
8846 applied to the lemma in question.
8847 \item Then the same is done with the paradigms for the target language
8848 lemma.
8849 \item After the lemmas and the corresponding paradigms have been
8850   selected, examples must be generated to show how these lemmas would
8851   be inflected according to the selected paradigms. To do this, we
8852   need the root of the lemma (\texttt{raiz\_LR and raiz\_RL}), as well
8853   as the example endings for the selected paradigm
8854   (\texttt{paradigma\_LR and paradigma\_RL}); these endings are
8855   obtained from the paradigm file. Finally, a string is build
8856   containing the generated examples (\texttt{ejemplos\_LR and
8857   ejemplos\_RL}), and these are displayed.
8858 \item If we arrived to this screen because we were validating words
8859 (\texttt{va\-li\-da=1}), then a button is added to the form, which
8860 allows us to delete the current entry if we decide not to validate it.
8861 \item If the user that arrived to this screen is a professional user,
8862 then a button is added to the form, which allows the user to select
8863 the option for the validation of words entered by non-professional
8864 users.
8865 \item Finally, after one of the action buttons located at the bottom
8866 of the form is pressed, the applicable actions are performed. If the
8867 chosen action is \textit{"Delete"}, which can only be the case if the
8868 user is validating entries, the current entry is deleted from the file
8869 of entries made by non-professional users.  If the chosen action is a
8870 confirmation (\textit{"Go on"} button), the module
8871 \texttt{insertarPro.php} or \texttt{insertar.php} is called, depending
8872 on whether the user is professional or non-professional respectively.
8873 These modules are in charge of inserting the words in the
8874 dictionaries.
8875 \end{itemize} After the entry has been inserted, the page
8876 \texttt{va\-li\-dar.php} or the page \texttt{selecc.php} are displayed
8877 again, depending on whether the user was doing a validation process
8878 (and then \textit{valida=1}) or a normal insertion.
8879
8880 \subsubsection{multip.php}
8881
8882 The code and behaviour of this module is the same as
8883 \textit{selecc.php}.  The only difference is that this module is
8884 designed for managing multiword units, whereas \textit{selec.php}
8885 manages the rest of units. Therefore, the main difference is the
8886 existence of the variables \texttt{\$LRcua} and \texttt{\$RLcua},
8887 which contain the invariable queue that comes after the variable part
8888 of a multiword. When the examples are displayed, besides showing the
8889 variable part inflected according to the selected paradigm, also and
8890 editable text box is displayed with the invariable queue.
8891
8892 When the button to continue with the insertion of the entry in the
8893 dictionaries is pressed, the module \textit{ins\_multip.php} is called
8894 instead of \textit{insertarPro.php}.
8895
8896
8897 \subsubsection{valida.php}
8898
8899 This module is called when a professional user presses the button
8900 "\textit{validate pairs}". It reads the dictionary of entries pending
8901 validation (\$fichSemi) for the applicable language pair. Then, the
8902 module enters a loop that goes through this file and reads the entries
8903 one by one. With the information of a given entry, it assigns values
8904 to a set of variables that will be used in the modules that will
8905 perform the subsequent actions. These variables are, for example:
8906 \begin{center}
8907 % use packages: array
8908 \begin{tabular}{ll}
8909 \$LRlem & \$RLlem \\
8910 \$paradigmaLR & \$paradigmaRL \\
8911 \$direccions & \$tipo \\
8912 \$comentarios & \$user \\
8913 \$geneLR & \$geneRL \\
8914 \$numLR & \$numRL \\
8915 \$LR & \$RL
8916 \end{tabular}
8917 \end{center}
8918
8919 Once the appropriate values for these variables have been established,
8920 the module \textit{selec.php} comes into action and treats the entries
8921 as if they were made by a professional user. After inserting the
8922 entries in the dictionaries by means of \textit{insertarPro.php}, the
8923 flow returns to \textit{valida.php}, which proceeds to the next entry
8924 to be validated.
8925
8926 \subsubsection{insertarPro.php}
8927 \label{insertarpro}
8928
8929 After the lemmas have been entered and their paradigms selected in
8930 \textit{selec.php}, this is the module that generates the
8931 corresponding \texttt{XML} entries and inserts them in the monolingual
8932 dictionaries and the bilingual dictionary.
8933
8934 It performs many operations similar to those performed in
8935 \textit{selec.php}, such as generating the examples for the inflected
8936 word. Thus, firstly, it gives values to \texttt{cadFich, show, tag,
8937 tagout, nota} depending on the part of speech (\texttt{\$tipus}) of
8938 the word being inserted.  It assigns paths to the file location
8939 variables and defines some required functions as occurred in
8940 \textit{selec.php}.
8941 \begin{itemize}
8942 \item \texttt{esVocalFuerte}: Returns \textit{true} if the vowel is
8943 strong, that is, \textit{a, e, o}.
8944 \item \texttt{esVocalDebil}: Returns \textit{true} if the vowel is
8945 weak, that is \textit{i, u}.
8946 \item \texttt{esVocal}: Returns \textit{true} if the character passed
8947 as an argument is a vowel.
8948 \item \texttt{diptongo}: Returns \textit{true} if the two letters
8949 passed as an argument make a diphthong. This will be the case when at
8950 least one of the two vowels is not strong.
8951 \item \texttt{acentuar}: It receives a text string and accentuates it
8952 according to the Spanish accentuation rules, depending on the
8953 parameter \textit{\$siguienteletra}. \nota{MG: only for Spanish?}
8954 \item \texttt{esMayuscula}: Returns \textit{true} if the character is
8955 in upper case.
8956 \item \texttt{TieneAcento}: Returns \textit{true} if the string has an
8957 accent.
8958 \item \texttt{acentua}: Accentuates the last accentuable vowel of a
8959 word with an open or closed accent, depending on the direction
8960 specified in the parameter \$sentit.\nota{MG: then not only for
8961 Spanish but also for Catalan or Occitan?}
8962 \item \texttt{PonQuitaAcento}: Inserts or removes the accent of the
8963 first string passed as an argument depending on whether the second
8964 string passed as an argument has an accent or not.
8965 \item \texttt{PosicioVocalTall}: Returns the position in the lemma
8966 (\$lema) for the vowel (\$vocal) that separates the root from the
8967 ending. The vowel is searched from the end to the beginning and the
8968 first occurrence of \$vocal is returned.
8969 \end{itemize}
8970
8971 Now, the same operations as in \textit{selec.php} are
8972 performed. Firstly, it makes sure that the entry is not yet in the
8973 dictionaries, and then generates the examples of the word inflected
8974 according to the paradigm previously selected. After this, it builds
8975 the string with the \texttt{XML} code that is going to be inserted in
8976 the source language monolingual dictionary. With the information on
8977 the lemmas entered in \textit{selec.php}, a text string is generated
8978 (\texttt{\$cad\_LR}) that contains the \texttt{XML} code for the
8979 monolingual dictionary. This string is displayed in a text box that
8980 can be manually edited. The same process is done to generate the
8981 string for the target language monolingual dictionary
8982 (\texttt{\$cad\_RL}) and for the bilingual dictionary
8983 (\texttt{\$cad\_bil}). Then, the
8984 possible comments and the name of the user making the entry are
8985 concatenated to these variables, if applicable.  Finally, the form
8986 screen is completed adding the buttons for accepting, deleting and going
8987 back.  The code to process each one of the possible actions is at the
8988 end of the file:
8989 \begin{itemize}
8990 \item \texttt{Insert: } In this case, it makes some character
8991   replacements so that the entry has the right format in the
8992   dictionaries, and inserts the strings \texttt{\$cad\_LR, \$cad\_bil,
8993   \$cad\_RL} in the source monolingual, bilingual and target
8994   monolingual dictionaries respectively (\texttt{\$fich\_LR,
8995   \$fich\_LRRL, \$fich\_RL}). If some error occurs when inserting the
8996   entry, a warning message is displayed. If \textit{insertarPro.php}
8997   was called from a word validation process (\textit{\$valida=1}),
8998   then the button "\textit{Continue}" is inserted to continue with the
8999   validation. If this is not the case, then a button to close the
9000   window is inserted, to allow the user to end the process.
9001 \item \texttt{Delete: } It deletes the entry from the file of entries
9002 pending validation.
9003 \end{itemize}
9004
9005 \subsubsection{ins\_multip.php}
9006
9007 It performs the same actions as \textit{insertarPro.php} but it is
9008 intended for multiword units. The main difference is the existence of
9009 two additional variables, \texttt{\$LRcua} and \texttt{\$RLcua}, that
9010 contain the invariable part of a multiword. When the entry is added to
9011 the dictionaries, this queue has to be inserted in the right place and
9012 the blanks have to be turned into \texttt{<b/>} tags.
9013
9014 \subsubsection{insertar.php}
9015
9016 The function of this module is very simple. It builds a text string
9017 with the information provided by \textit{selec.php} separated by
9018 tabs. This string contains all the required information to generate a
9019 dictionary entry:
9020
9021 \texttt{\$LRlem.\$RLlem.\$paradigmaLR.\$direccion.\$paradigmaRL.}
9022
9023
9024 \texttt{\$tipo.\$comentarios.\$user.\$geneLR.\$geneRL.}
9025
9026
9027
9028 This entry is saved in a file (\$fichSemi) that contains the queue
9029 with the entries waiting for validation inserted by non-professional
9030 users. When a professional user wishes to validate pending entries,
9031 the \textit{valida.php} module will read from this file.
9032
9033
9034 \subsubsection{verSemi.php}
9035
9036 It displays the file of entries waiting for validation, in this way:
9037 it reads the file containing the entries (\textit{\$fichSemi}) and
9038 enters a loop that reads all the entries of the file. For each entry,
9039 it displays a line with the following information:
9040
9041 \texttt{\$LRlem
9042 \$paradigmaLR
9043 \$direccion
9044 \$RLlem}
9045
9046 \texttt{\$paradigmaRL
9047 \$tipo
9048 \$comentarios}
9049
9050 \subsection{Dictionary files}
9051
9052 The files containing the entries inserted from the form are saved in
9053 \texttt{/dics}. There are here two kinds of files:
9054
9055 \begin{itemize}
9056 \item \texttt{apertium-ll-rr.xx.webform}: This is the file that
9057 contains the entries in \texttt{XML} code, ready to be copied to the
9058 final dictionaries. The name of the file has the presented structure,
9059 where \texttt{ll-rr} are the initials of the language pair of the
9060 translator and \texttt{xx} the initials of the language of the
9061 monolingual dictionary or the languages of the bilingual dictionary
9062 referred to, as applicable. For example, the initials of the
9063 Spanish-Catalan translator are \texttt{es-ca}. For this translator, we
9064 have the Spanish monolingual (\texttt{es}), the Catalan monolingual
9065 (\texttt{ca}) and the bilingual (\texttt{es-ca})
9066 dictionaries. Therefore, this directory will contain the following
9067 files for the Spanish-Catalan translator:
9068 \begin{center} \texttt{apertium-es-ca.es.webform
9069 apertium-es-ca.ca.webform apertium-es-ca.es-ca.webform}
9070 \end{center}
9071
9072
9073 \item \texttt{oo-mm.semi.dic}: This is the file containing the entries
9074 pending validation for a given language pair. \texttt{oo-mm} are the
9075 initials of the pair. For example, for the Spanish-Catalan translator
9076 this file would be: \texttt{es-ca.semi.dic}
9077
9078
9079 \end{itemize}
9080
9081 \subsection{Paradigm files}
9082 \label{paradigm}
9083
9084 The paradigms used for each language pair are specified in two
9085 \texttt{XML} files named \texttt{paradig.ll-rr.xx.xml}, where
9086 \texttt{xx} are the initials of the language and \texttt{ll-rr} the
9087 initials of the language pair. These files consist of a set of entries
9088 describing the paradigms or inflection models for the words of a given
9089 language. The \texttt{XML} file has the following parts:
9090 \begin{itemize}
9091 \item Head/root of the specification file.\\
9092 \begin{alltt}
9093 <?xml version="1.0" encoding="ISO-8859-1"?>
9094 <?xml-stylesheet type="text/xsl" href="paradigmas.xsl"?>
9095 <!DOCTYPE form SYSTEM "form.dtd">
9096 <form lang="oc" langpair="oc-ca">
9097 \end{alltt}
9098 The \textit{lang} attribute states the initials of the
9099 language for which paradigms are specified, and the \textit{langpair}
9100 attribute states the initials of the language pair of the translator
9101 for which the specification is made. It is required that the same
9102 directory containing the paradigm files contains the \texttt{form.dtd}
9103 file, which is the DTD specifying these files. You can find this DTD
9104 in the Appendix \ref{ss:dtdparadigmes}.
9105 \item A set of elements that define the paradigms. To explain its
9106 format, we reproduce the following example: \\
9107 \begin{alltt}
9108 <entry PoS="adj" nbr="sg_pl" gen="mf">
9109         <endings>
9110                 <stem>amable</stem>
9111                 <ending/>
9112                 <ending>s</ending>
9113         </endings>
9114         <paradigms howmany="1">
9115                 <par n="amable\_\_adj"/>
9116         </paradigms>
9117 </entry>
9118 \end{alltt}
9119 Each paradigm is specified in a \texttt{<entry>} element.
9120 This element can have three attributes:
9121   \begin{itemize}
9122   \item \textit{PoS}: the part of speech of the paradigm. It can take
9123   the values: acr, adj, adv, noun, pname, pr, verbo. \nota{also
9124   cnjadv?} It is mandatory for any part of speech.
9125   \item \textit{nbr}: the numbers admitted by the paradigm. It can
9126   take the values: sg, pl, sg\_pl, sp.
9127   \item \textit{gen}: the genders admitted by the paradigm. It can
9128   take the values: m, f, m f, mf.
9129   \end {itemize} It has two more elements:
9130   \begin{itemize}
9131   \item \texttt{endings}: the root and the endings used to select the
9132   paradigm in the form and display the inflection examples.
9133
9134   \item \texttt{paradigms}: specification of the paradigm/s that
9135   define the inflection of an entry.  It requires the attribute
9136   \textit{howmany} , which specifies the number of paradigms used by
9137   an entry. Each used paradigm is indicated in a line, where the name
9138   of the paradigm in the dictionary is inserted according to this
9139   format:
9140     \begin{center}
9141 \begin{alltt}
9142 <par n="long\_\_adj"/>
9143 \end{alltt}
9144     \end{center}
9145   \end{itemize}
9146 \end{itemize}
9147
9148 From the \texttt{XML} paradigm file, it is necessary to generate the
9149 files directly used by the modules of the forms. Running the script
9150 \texttt{/private/gen\_paradig.sh}, the process is automatically done
9151 for all the available language pairs:
9152 \begin{alltt}
9153    #  cd private
9154    # ./gen\_paradig.sh
9155 \end{alltt}
9156 To add a new paradigm to the forms, an appropriate entry
9157 has to be added to the \texttt{XML} paradigm file, and then run the
9158 previous script to update the working files.
9159
9160 The automatic process can also be done manually if we do not want to
9161 update the files for all the installed language pairs. The manual
9162 generation of the working files has to be done with a \texttt{XSL}
9163 style sheet using the following command:
9164 \begin{alltt}
9165    # xsltproc paradigmas.xsl paradigm\_file.xml
9166                                      | ./creaparadig.awk
9167 \end{alltt}
9168
9169 This action generates a working file for each part of speech. The
9170 generated files are saved in the directories
9171 \texttt{/private/paradigmas.ll-rr}.  These directories contain the
9172 files with the paradigms that can be used for each language pair
9173 \texttt{ll-rr} and for each part of speech.  Each one of these
9174 directories contain the following files:
9175 \begin{itemize}
9176 \item \texttt{paradigacr\_xx}: paradigms for acronyms in the language
9177 \texttt{xx}.
9178 \item \texttt{paradigadj\_xx}: paradigms for adjectives in the
9179 language \texttt{xx}.
9180 \item \texttt{paradigadv\_xx}: paradigms for adverbs in the language
9181 \texttt{xx}.
9182 \item \texttt{paradigcnjadv\_xx}: paradigms for adverbial conjunctions
9183 in the language \texttt{xx}.
9184 \item \texttt{paradigcnjcoo\_xx}: paradigms for copulative
9185 conjunctions in the language \texttt{xx}.\nota{MG: aquesta no està en
9186 la pàgina web del formulari}
9187 \item \texttt{paradigcnjsub\_xx}: paradigms for subordinating
9188 conjunctions in the language \texttt{xx}.\nota{ídem}
9189 \item \texttt{paradignoun\_xx}: paradigms for nouns in the language
9190 \texttt{xx}.
9191 \item \texttt{paradigpname\_xx}: paradigms for proper nouns in the
9192 language \texttt{xx}.
9193 \item \texttt{paradigpr\_xx}: paradigms for prepositions in the
9194 language \texttt{xx}.
9195 \item \texttt{paradigverb\_xx}: paradigms for verbs in the language
9196 \texttt{xx}.
9197 \end{itemize}
9198
9199 The files consist of one entry per line. Each entry contains the
9200 following information:
9201
9202 \begin{center} % use packages: array
9203 \begin{tabular}{lllll}
9204 \textit{examples} & \textit{number of paradigms} & \textit{model\_paradigms} & \textit{(numbers)} &
9205 \textit{(genders)}
9206 \end{tabular}
9207 \end{center}
9208
9209
9210 The separator used for the different parts of an entry is the tab.
9211 \begin{itemize}
9212 \item \textit{Examples}: the endings that will be used to generate the
9213 examples when the user chooses this paradigm as a model for the word
9214 being inserted.
9215 \item \textit{Number of paradigms}: the number of paradigms that are
9216 used in the dictionary to inflect this inflection model.
9217 \item \textit{Model paradigms}: the name that have in the dictionary
9218 the paradigm/s that will be used to inflect a new entry.
9219 \item \textit{(Numbers)}: Only completed for names, adjectives and
9220 acronyms.  Refers to the grammatical number in the paradigm.
9221 \item \textit{(Genders)}: Only completed for names, adjectives and
9222 acronyms.  Refers to the grammatical gender in the paradigm.
9223 \end{itemize}
9224
9225 So, therefore, for the Spanish-Catalan translator we would have the
9226 directory \texttt{/private\-/paradigmas.es-ca} that would contain two
9227 \texttt{XML} files: \texttt{paradig.es-ca.es.xml} and
9228 \texttt{paradig.es-ca.ca.xml}, specifying the paradigms used in each
9229 language. From these files, you may generate all the paradigm files
9230 for the language pair using the command:
9231 \begin{alltt}
9232    #  cd private/paradigmas.es-ca
9233    #  xsltproc ../paradigmas.xsl paradig.es-ca.es.xml
9234                                     | ../creaparadig.awk
9235    #  xsltproc ../paradigmas.xsl paradig.es-ca.ca.xml
9236                                     | ../creaparadig.awk
9237 \end{alltt}
9238
9239
9240 Or you can automatically generate them for all the language pairs,
9241 using:
9242 \begin{alltt}
9243    #  ./private/gen\_paradig.sh
9244 \end{alltt}
9245
9246 Among the generated working files, one would be, for example, a file
9247 called \texttt{paradigverb\_ca} that would contain the possible verb
9248 paradigms for Catalan, where a possible line might be:
9249
9250 \begin{center}
9251 \texttt{abra/çar /ço /ci        1       abalan/çar\_\_vblex}
9252 \end{center}
9253
9254 that is generated from the \texttt{XML} entry:
9255
9256 \begin{alltt}
9257 <entry PoS="verb">
9258         <endings>
9259                 <stem>abra</stem>
9260                 <ending>çar</ending>
9261                 <ending>ço</ending>
9262                 <ending>ci</ending>
9263         </endings>
9264         <paradigms howmany="1">
9265                 <par n="abalan/çar\_\_vblex"/>
9266         </paradigms>
9267 </entry>
9268 \end{alltt}
9269
9270
9271
9272 \section{Using the forms}
9273 \label{ss:formus}
9274 \subsection{Introduction}
9275
9276
9277  When a user wants to insert new entries in a
9278 dictionary, he/she has to use a web navigator to connect to the
9279 address where the form server has been installed; for example:
9280 \begin{center} \texttt{http://xixona.dlsi.ua.es/forms}
9281 \end{center} A web page will be displayed with the portal of access to
9282 \texttt{Opentrad\- Apertium\- Insertion\- Form}. The left margin
9283 contains links to get more \textit{information} , \textit{download}
9284 the programs and \textit{contact} the administrator of the forms to
9285 request registration as a system user. To register as a user you will
9286 have to send an e-mail to the administrator.
9287
9288 \nota{Canviar a tot arreu \emph{registrar} per \emph{inscribir}.}  To
9289 insert new words, you will have to introduce the required data in the
9290 form and press the \textit{'Go On'} button; at this point you will
9291 have to identify yourself as a registered user, or else you will not
9292 be able to continue. There are two user registration types: you can be
9293 registered as a \emph{professional} or as a \emph{non-professional}
9294 user. Each mode has different functionalities, that are explained in
9295 the following section.
9296
9297 \subsection{Insertion of entries}
9298 \label{insertion}
9299
9300 \subsubsection{Professional mode}
9301
9302 If you want to add a new entry to the dictionaries, you have to go to
9303 the section of the language pair you want to improve. There, you have
9304 to enter the source language lemma and the target language lemma, and
9305 select their part of speech. Press the \textit{Go on} button to
9306 continue.
9307
9308 A new window is displayed, with the lemmas and some parameters used to
9309 define the entries. If the entry already exists in one of the
9310 dictionaries, a warning message is displayed and the system automatically
9311 selects one-way translation (from left to right or vice versa). If
9312 none of the dictionaries contain the entry, the entry will be entered
9313 for both directions.
9314
9315 In this window you can do three actions:
9316
9317 \notavisible{Cal repassar la primera oració del paràgraf següent;
9318 sembla que hi ha algun material que hauria de ser esborrat; Una altra
9319 cosa, els formularis en l'actualitat no tenen suport per a traduccions
9320 múltiples, segons sembla. Caldria fer constar aquesta circumstància en
9321 algun lloc.}
9322 \begin{itemize}
9323 \item Choose the paradigm for the SL and the TL lemmas (this is
9324   mandatory, the remaining actions are not).\footnote{Choosing the
9325   paradigm has to be done very carefully. You have to choose the
9326   paradigm that describes exactly the grammatical and inflection
9327   characteristics of the inserted word. In the case of adjectives,
9328   nouns and acronyms, you have to select a paradigm that fits the
9329   inflection of the word and the genders it may present. For example,
9330   in the case of acronyms you have to consider the gender and the
9331   number admitted by each possible paradigm; the paradigm BBC, for
9332   example, is for feminine singular acronyms, whereas SA is for
9333   feminine acronyms that may have plural form. In the case of proper
9334   nouns, you have to choose a different paradigm depending on whether
9335   the word is a proper noun of a thing (e.g. a newspaper), a person or
9336   a place.}
9337 \item Select the translation direction of the entry if it is different
9338 from the automatically suggested.
9339 \item Add comments to the entry, that will be included in the final
9340 dictionary.
9341 \end{itemize}
9342
9343 Once the required actions have been done, you have to press
9344 \textit{'Go on'} if you want to confirm the entry or \textit{'Close'}
9345 if you want to cancel the insertion operation.
9346
9347 The following and last screen displays the three generated
9348 \texttt{XML} entries for the SL monolingual, TL monolingual and
9349 bilingual dictionaries. These entries are displayed in three text
9350 boxes that can be edited if you want to do any change. Once you
9351 checked the entries, press the \textit{'Insert'} button to finally
9352 insert them in the corresponding dictionaries. You can also press the
9353 \textit{'Go back'} button to return to the previous step.
9354
9355 \subsubsection{Non-professional mode}
9356
9357 When a user enters the insertion system as a non-professional user,
9358 the word insertion mechanism is the same as for the professional user,
9359 with the difference that the entries will not be saved in the
9360 dictionaries generated by the forms, but will be entered in a queue of
9361 entries pending validation. The words in this queue will not be
9362 inserted in the dictionaries until a professional user validates them.
9363
9364 \subsection{Validating entries}
9365
9366 Professional users have two additional links in the screen for
9367 paradigm selection:
9368 \begin{itemize}
9369 \item \textit{See pairs to be validated}: Selecting this option will
9370 open a screen that displays the content of the file of entries pending
9371 validation; these are the entries inserted by non-professional
9372 users. This is a merely informative screen, which can be closed
9373 pressing the \textit{'Close'} button.
9374 \item \textit{Validate pairs}: This option allows a professional user
9375 to validate one by one the entries waiting for validation. Selecting
9376 this button will open the screen for the selection of paradigms
9377 already described in section \ref{insertion}. This screen will show
9378 the data selected by the user for the added entry.  Now, the
9379 professional user can modify the lemmas, delete the entry or continue
9380 with the insertion process. If the user decides to proceed with the
9381 insertion, the process is the same as for a normal insertion; only at
9382 the end, when the entry is finally added to the dictionaries of the
9383 form, the control returns to the following entry of the queue pending
9384 validation and displays it.
9385
9386 This process is repeated until all the words of the queue are
9387 validated or until the process is finished by selecting
9388 \textit{'Close'}.
9389
9390 \end{itemize}
9391
9392
9393
9394 \newpage
9395 \appendix
9396
9397 \chapter[XML DTDs]{Document Type Definitions (DTD) in XML}
9398 \label{DTDs}
9399
9400 \section{DTD for the format of dictionaries}
9401 \label{ss:dtd_dics}
9402
9403
9404 Document type definition for the format of morphological, bilingual
9405 and post-generation dictionaries in XML; this definition is provided
9406 with the \texttt{apertium} package (last version) which can be
9407 downloaded from \url{http://www.sourceforge.net}.
9408
9409 The description of its elements can be found in Section
9410 \ref{formatodics}.
9411
9412
9413
9414 \begin{small}
9415 \begin{alltt}
9416 <!\textsl{ELEMENT} \textbf{dictionary} (alphabet?, sdefs?,
9417                       pardefs?, section+)>
9418
9419 <!\textsl{ELEMENT} \textbf{alphabet} (\textsl{#PCDATA})>
9420
9421 <!\textsl{ELEMENT} \textbf{sdefs} (sdef+)>
9422
9423 <!\textsl{ELEMENT} \textbf{sdef} \textsl{EMPTY}>
9424 <!\textsl{ATTLIST} sdef n ID \textsl{#REQUIRED}>
9425
9426 <!\textsl{ELEMENT} \textbf{pardefs} (pardef+)>
9427
9428 <!\textsl{ELEMENT} \textbf{pardef} (e+)>
9429 <!\textsl{ATTLIST} pardef n CDATA \textsl{#REQUIRED}>
9430
9431 <!\textsl{ELEMENT} \textbf{section} (e+)>
9432
9433 <!\textsl{ATTLIST} section id ID \textsl{#REQUIRED}
9434                   type (standard|inconditional|postblank) \textsl{#REQUIRED}>
9435
9436 <!\textsl{ELEMENT} \textbf{e} (i | p | par | re)+>
9437 <!\textsl{ATTLIST} e r (LR|RL) \textsl{#IMPLIED}
9438             lm CDATA \textsl{#IMPLIED}
9439             a CDATA \textsl{#IMPLIED}
9440             c CDATA \textsl{#IMPLIED}
9441
9442 <!\textsl{ELEMENT} \textbf{par} \textsl{EMPTY}>
9443 <!\textsl{ATTLIST} par n CDATA \textsl{#REQUIRED}>
9444
9445 <!\textsl{ELEMENT} \textbf{i} (\textsl{#PCDATA} | b | s | g | j | a)*>
9446
9447 <!\textsl{ELEMENT} \textbf{re} (\textsl{#PCDATA})>
9448
9449 <!\textsl{ELEMENT} \textbf{p} (l, r)>
9450
9451 <!\textsl{ELEMENT} \textbf{l} (\textsl{#PCDATA} | a | b | g | j | s)*>
9452
9453 <!\textsl{ELEMENT} \textbf{r} (\textsl{#PCDATA} | a | b | g | j | s)*>
9454
9455 <!\textsl{ELEMENT} \textbf{a} \textsl{EMPTY}>
9456
9457 <!\textsl{ELEMENT} \textbf{b} \textsl{EMPTY}>
9458
9459 <!\textsl{ELEMENT} \textbf{g} (\textsl{#PCDATA} | a | b | j | s)*>
9460 <!\textsl{ATTLIST} g i CDATA \textsl{#IMPLIED}>
9461
9462 <!\textsl{ELEMENT} \textbf{j} \textsl{EMPTY}>
9463
9464 <!\textsl{ELEMENT} \textbf{s} \textsl{EMPTY}>
9465
9466 <!\textsl{ATTLIST} s n \textsl{IDREF} \textsl{#REQUIRED}>
9467
9468 \end{alltt}
9469 \end{small}
9470
9471
9472 \subsection{Modification of the DTD of dictionaries for lexical
9473 selection}
9474 \label{dixdtd}
9475
9476 The DTD for the format of dictionaries has been slightly modified so
9477 that dictionaries can be used in a system that has a lexical selection
9478 module. The change only affects the \texttt{<e>} element and is
9479 displayed next.
9480
9481
9482
9483 \begin{small}
9484 \begin{alltt}
9485
9486 ...
9487 <!\textsl{ATTLIST} e
9488         r (LR|RL) \textsl{#IMPLIED}
9489         lm \textsl{CDATA #IMPLIED}
9490         a \textsl{CDATA #IMPLIED}
9491         c \textsl{CDATA #IMPLIED}>
9492         i CDATA \textsl{#IMPLIED}
9493         slr CDATA \textsl{#IMPLIED}
9494         srl CDATA \textsl{#IMPLIED}>
9495
9496   <!-- r: restriction LR: left-to-right,
9497                       RL: right-to-left -->
9498   <!-- lm: lemma -->
9499   <!-- a: author -->
9500   <!-- c: comment -->
9501   <!-- i: ignore ('yes') means ignore, otherwise it is not ignored) -->
9502   <!-- slr: translation sense when translating from left to right -->
9503   <!-- srl: translation sense when translating from right to left -->
9504 ...
9505
9506 \end{alltt}
9507 \end{small}
9508
9509
9510
9511
9512 \section[DTD for the tagger file]{DTD for the format of the tagger
9513 file}
9514 \label{ss:DTD_desambiguador}
9515
9516 DTD that defines the format of the tagger specification file.  This
9517 definition is provided with the \texttt{apertium} package (last
9518 version) which can be downloaded from
9519 \url{http://www.sourceforge.net}.
9520
9521 The description of its elements can be found in
9522 Section~\ref{formatotagger}.
9523
9524   \begin{small}
9525   \begin{alltt}
9526 <!\textsl{ELEMENT} \textbf{tagger} (tagset,forbid?,enforce-rules?,preferences?)>
9527 <!\textsl{ATTLIST} tagger name \textsl{CDATA} \textsl{#REQUIRED}>
9528
9529 <!\textsl{ELEMENT} \textbf{tagset} (def-label+,def-mult*)>
9530
9531 <!\textsl{ELEMENT} \textbf{def-label} (tags-item+)>
9532 <!\textsl{ATTLIST} def-label name \textsl{CDATA} \textsl{#REQUIRED}
9533                     closed \textsl{CDATA} \textsl{#IMPLIED}>
9534
9535 <!\textsl{ELEMENT} \textbf{tags-item} \textsl{#EMPTY}>
9536 <!\textsl{ATTLIST} tags-item tags \textsl{CDATA} \textsl{#REQUIRED}
9537                     lemma \textsl{CDATA} \textsl{#IMPLIED}>
9538
9539 <!\textsl{ELEMENT} \textbf{def-mult} (sequence+)>
9540 <!\textsl{ATTLIST} def-mult name \textsl{CDATA} \textsl{#REQUIRED}
9541                    closed \textsl{CDATA} \textsl{#IMPLIED}>
9542
9543 <!\textsl{ELEMENT} \textbf{sequence} ((tags-item|label-item)+)>
9544
9545 <!\textsl{ELEMENT} \textbf{label-item} \textsl{#EMPTY}>
9546 <!\textsl{ATTLIST} label-item label \textsl{CDATA} \textsl{#REQUIRED}>
9547
9548 <!\textsl{ELEMENT} \textbf{forbid} (label-sequence+)>
9549
9550 <!\textsl{ELEMENT} \textbf{label-sequence} (label-item+)>
9551
9552 <!\textsl{ELEMENT} \textbf{enforce-rules} (enforce-after+)>
9553
9554 <!\textsl{ELEMENT} \textbf{enforce-after} (label-set)>
9555 <!\textsl{ATTLIST} enforce-after label \textsl{CDATA} \textsl{#REQUIRED}>
9556
9557 <!\textsl{ELEMENT} \textbf{label-set} (label-item+)>
9558
9559 <!\textsl{ELEMENT} \textbf{preferences} (prefer+)>
9560
9561 <!\textsl{ELEMENT} \textbf{prefer} \textsl{EMPTY}>
9562 <!\textsl{ATTLIST} prefer tags \textsl{CDATA} \textsl{#REQUIRED}>
9563   \end{alltt}
9564 \end{small}
9565
9566
9567
9568 \section[DTD of the chunker module]{DTD of the structural transfer
9569 module (chunker)}
9570 \label{ss:dtdtransfer}
9571
9572 DTD for the format of the structural transfer rules in the
9573 \texttt{chunker} module.  This definition is provided with the
9574 \texttt{apertium} package (version 2.0) which can be downloaded from
9575 \url{http://www.sourceforge.net}.
9576
9577 Its elements are described in Section \ref{formatotransfer}.
9578
9579
9580 \begin{small}
9581 \begin{alltt}
9582 <!\textsl{ENTITY} \% condition "(and|or|not|equal|begins-with|
9583                        ends-with|contains-substring|in)">
9584 <!\textsl{ENTITY} \% container "(var|clip)">
9585 <!\textsl{ENTITY} \% sentence "(let|out|choose|modify-case|
9586                       call-macro|append)">
9587 <!\textsl{ENTITY} \% value "(b|clip|lit|lit-tag|var|get-case-from|
9588                    case-of|concat)">
9589 <!\textsl{ENTITY} \% stringvalue "(clip|lit|var|get-case-from|
9590                          case-of)">
9591
9592 <!\textsl{ELEMENT} \textbf{transfer} (section-def-cats,
9593                     section-def-attrs,
9594                     section-def-vars,
9595                     section-def-lists?,
9596                     section-def-macros?,
9597                     section-rules)>
9598
9599 <!\textsl{ATTLIST} transfer default (lu|chunk) \textsl{#IMPLIED}>
9600
9601 <!\textsl{ELEMENT} \textbf{section-def-cats} (def-cat+)>
9602
9603 <!\textsl{ELEMENT} \textbf{def-cat} (cat-item+)>
9604 <!\textsl{ATTLIST} def-cat n ID \textsl{#REQUIRED}>
9605
9606 <!\textsl{ELEMENT} \textbf{cat-item} \textsl{EMPTY}>
9607 <!\textsl{ATTLIST} cat-item lemma CDATA \textsl{#IMPLIED}
9608                    tags CDATA \textsl{#REQUIRED} >
9609
9610 <!\textsl{ELEMENT} \textbf{section-def-attrs} (def-attr+)>
9611
9612 <!\textsl{ELEMENT} \textbf{def-attr} (attr-item+)>
9613 <!\textsl{ATTLIST} def-attr n ID \textsl{#REQUIRED}>
9614
9615 <!\textsl{ELEMENT} \textbf{attr-item} \textsl{EMPTY}>
9616 <!\textsl{ATTLIST} attr-item tags CDATA \textsl{#IMPLIED}>
9617
9618 <!\textsl{ELEMENT} \textbf{section-def-vars} (def-var+)>
9619
9620 <!\textsl{ELEMENT} \textbf{def-var} \textsl{EMPTY}>
9621 <!\textsl{ATTLIST} def-var n ID \textsl{#REQUIRED}>
9622
9623 <!\textsl{ELEMENT} \textbf{section-def-lists} (def-list)+>
9624
9625 <!\textsl{ELEMENT} \textbf{def-list} (list-item+)>
9626 <!\textsl{ATTLIST} def-list n ID \textsl{#REQUIRED}>
9627
9628 <!\textsl{ELEMENT} \textbf{list-item} \textsl{EMPTY}>
9629 <!\textsl{ATTLIST} list-item v CDATA \textsl{#REQUIRED}>
9630
9631 <!\textsl{ELEMENT} \textbf{section-def-macros} (def-macro)+>
9632
9633 <!\textsl{ELEMENT} \textbf{def-macro} (\%sentence;)+>
9634 <!\textsl{ATTLIST} def-macro n ID \textsl{#REQUIRED}>
9635 <!\textsl{ATTLIST} def-macro npar CDATA \textsl{#REQUIRED}>
9636
9637 <!\textsl{ELEMENT} \textbf{section-rules} (rule+)>
9638
9639 <!\textsl{ELEMENT} \textbf{rule} (pattern, action)>
9640 <!\textsl{ATTLIST} rule comment CDATA \textsl{#IMPLIED}>
9641
9642 <!\textsl{ELEMENT} \textbf{pattern} (pattern-item+)>
9643
9644 <!\textsl{ELEMENT} \textbf{pattern-item} \textsl{EMPTY}>
9645 <!\textsl{ATTLIST} pattern-item n \textsl{IDREF} \textsl{#REQUIRED}>
9646
9647 <!\textsl{ELEMENT} \textbf{action} (\%sentence;)*>
9648
9649 <!\textsl{ELEMENT} \textbf{choose} (when+,otherwise?)>
9650
9651 <!\textsl{ELEMENT} \textbf{when} (test,(\%sentence;)*)>
9652
9653 <!\textsl{ELEMENT} \textbf{otherwise} (\%sentence;)+>
9654
9655 <!\textsl{ELEMENT} \textbf{test} (\%condition;)+>
9656
9657 <!\textsl{ELEMENT} \textbf{and} ((\%condition;),(\%condition;)+)>
9658
9659 <!\textsl{ELEMENT} \textbf{or} ((\%condition;),(\%condition;)+)>
9660
9661 <!\textsl{ELEMENT} \textbf{not} (\%condition;)>
9662
9663 <!\textsl{ELEMENT} \textbf{equal} (\%value;,\%value;)>
9664 <!\textsl{ATTLIST} equal caseless (no|yes) \textsl{#IMPLIED}>
9665
9666 <!\textsl{ELEMENT} \textbf{begins-with} (\%value;,\%value;)>
9667 <!\textsl{ATTLIST} begins-with caseless (no|yes) \textsl{#IMPLIED}>
9668
9669 <!\textsl{ELEMENT} \textbf{ends-with} (\%value;,\%value;)>
9670 <!\textsl{ATTLIST} ends-with caseless (no|yes) \textsl{#IMPLIED}>
9671
9672 <!\textsl{ELEMENT} \textbf{contains-substring} (\%value;,\%value;)>
9673 <!\textsl{ATTLIST} contains-substring caseless (no|yes) \textsl{#IMPLIED}>
9674
9675 <!\textsl{ELEMENT} \textbf{in} (\%value;, list)>
9676 <!\textsl{ATTLIST} in caseless (no|yes) \textsl{#IMPLIED}>
9677
9678 <!\textsl{ELEMENT} \textbf{list} \textsl{EMPTY}>
9679 <!\textsl{ATTLIST} list n \textsl{IDREF} \textsl{#REQUIRED}>
9680
9681 <!\textsl{ELEMENT} \textbf{let} (\%container;, \%value;)>
9682
9683 <!\textsl{ELEMENT} \textbf{append} (\%value;)+>
9684 <!\textsl{ATTLIST} append n \textsl{IDREF} \textsl{#REQUIRED}>
9685
9686 <!\textsl{ELEMENT} \textbf{out} (mlu|lu|b|chunk)+>
9687
9688 <!\textsl{ELEMENT} \textbf{modify-case} (\%container;, \%stringvalue;)>
9689
9690 <!\textsl{ELEMENT} \textbf{call-macro} (with-param)*>
9691 <!\textsl{ATTLIST} call-macro n \textsl{IDREF} \textsl{#REQUIRED}>
9692
9693 <!\textsl{ELEMENT} \textbf{with-param} \textsl{EMPTY}>
9694 <!\textsl{ATTLIST} with-param pos CDATA \textsl{#REQUIRED}>
9695
9696 <!\textsl{ELEMENT} \textbf{clip} \textsl{EMPTY}>
9697 <!\textsl{ATTLIST} clip pos CDATA \textsl{#REQUIRED}
9698                side (sl|tl) \textsl{#REQUIRED}
9699                part CDATA \textsl{#REQUIRED}
9700                queue CDATA \textsl{#IMPLIED}
9701                link-to CDATA \textsl{#IMPLIED}>
9702
9703 <!\textsl{ELEMENT} \textbf{lit} \textsl{EMPTY}>
9704 <!\textsl{ATTLIST} lit v CDATA \textsl{#REQUIRED}>
9705
9706 <!\textsl{ELEMENT} \textbf{lit-tag} \textsl{EMPTY}>
9707 <!\textsl{ATTLIST} lit-tag v CDATA \textsl{#REQUIRED}>
9708
9709 <!\textsl{ELEMENT} \textbf{var} \textsl{EMPTY}>
9710 <!\textsl{ATTLIST} var n \textsl{IDREF} \textsl{#REQUIRED}>
9711
9712 <!\textsl{ELEMENT} \textbf{get-case-from} (clip|lit|var)>
9713 <!\textsl{ATTLIST} get-case-from pos CDATA \textsl{#REQUIRED}>
9714
9715 <!\textsl{ELEMENT} \textbf{case-of} \textsl{EMPTY}>
9716 <!\textsl{ATTLIST} case-of pos CDATA \textsl{#REQUIRED}
9717                   side (sl|tl) \textsl{#REQUIRED}
9718                   part CDATA \textsl{#REQUIRED}>
9719
9720 <!\textsl{ELEMENT} \textbf{concat} (\%value;)+>
9721
9722 <!\textsl{ELEMENT} \textbf{mlu} (lu+)>
9723
9724 <!\textsl{ELEMENT} \textbf{lu} (\%value;)+>
9725
9726 <!\textsl{ELEMENT} \textbf{chunk} (tags,(mlu|lu|b)+)>
9727 <!\textsl{ATTLIST} chunk name CDATA \textsl{#IMPLIED}
9728                 namefrom CDATA \textsl{#IMPLIED}
9729                 case CDATA \textsl{#IMPLIED}>
9730
9731 <!\textsl{ELEMENT} \textbf{tags} (tag+)>
9732 <!\textsl{ELEMENT} \textbf{tag} (\%value;)>
9733
9734 <!\textsl{ELEMENT} \textbf{b} \textsl{EMPTY}>
9735 <!\textsl{ATTLIST} b pos CDATA \textsl{#IMPLIED}>
9736
9737 \end{alltt}
9738 \end{small}
9739
9740
9741
9742 \newpage
9743 \section{DTD of the interchunk module}
9744 \label{ss:dtdinterchunk}
9745
9746 DTD for the format of the structural transfer rules in the
9747 \texttt{interchunk} module. This definition is provided with the
9748 \texttt{apertium} package (version 2.0) which can be downloaded from
9749 \url{http://www.sourceforge.net}.
9750
9751 Its elements are described in Section \ref{formatotransfer}.
9752
9753
9754 \begin{small}
9755 \begin{alltt}
9756
9757 <!\textsl{ENTITY} \% condition "(and|or|not|equal|begins-with|
9758                        ends-with|contains-substring|in)">
9759 <!\textsl{ENTITY} \% container "(var|clip)">
9760 <!\textsl{ENTITY} \% sentence "(let|out|choose|modify-case|
9761                       call-macro|append)">
9762 <!\textsl{ENTITY} \% value "(b|clip|lit|lit-tag|var|get-case-from|
9763                    case-of|concat)">
9764 <!\textsl{ENTITY} \% stringvalue "(clip|lit|var|get-case-from|
9765                          case-of)">
9766
9767 <!\textsl{ELEMENT} \textbf{interchunk} (section-def-cats,
9768                       section-def-attrs,
9769                       section-def-vars,
9770                       section-def-lists?,
9771                       section-def-macros?,
9772                       section-rules)>
9773
9774 <!\textsl{ELEMENT} \textbf{section-def-cats} (def-cat+)>
9775
9776 <!\textsl{ELEMENT} \textbf{def-cat} (cat-item+)>
9777 <!\textsl{ATTLIST} def-cat n ID \textsl{#REQUIRED}>
9778
9779 <!\textsl{ELEMENT} \textbf{cat-item} \textsl{EMPTY}>
9780 <!\textsl{ATTLIST} cat-item lemma CDATA \textsl{#IMPLIED}
9781                     tags CDATA \textsl{#REQUIRED} >
9782
9783 <!\textsl{ELEMENT} \textbf{section-def-attrs} (def-attr+)>
9784
9785 <!\textsl{ELEMENT} \textbf{def-attr} (attr-item+)>
9786 <!\textsl{ATTLIST} def-attr n ID \textsl{#REQUIRED}>
9787
9788 <!\textsl{ELEMENT} \textbf{attr-item} \textsl{EMPTY}>
9789 <!\textsl{ATTLIST} attr-item tags CDATA \textsl{#IMPLIED}>
9790
9791 <!\textsl{ELEMENT} \textbf{section-def-vars} (def-var+)>
9792
9793 <!\textsl{ELEMENT} \textbf{def-var} \textsl{EMPTY}>
9794 <!\textsl{ATTLIST} def-var n ID \textsl{#REQUIRED}>
9795
9796 <!\textsl{ELEMENT} \textbf{section-def-lists} (def-list)+>
9797
9798 <!\textsl{ELEMENT} \textbf{def-list} (list-item+)>
9799 <!\textsl{ATTLIST} def-list n ID \textsl{#REQUIRED}>
9800
9801 <!\textsl{ELEMENT} \textbf{list-item} \textsl{EMPTY}>
9802 <!\textsl{ATTLIST} list-item v CDATA \textsl{#REQUIRED}>
9803
9804 <!\textsl{ELEMENT} \textbf{section-def-macros} (def-macro)+>
9805
9806 <!\textsl{ELEMENT} \textbf{def-macro} (\%sentence;)+>
9807 <!\textsl{ATTLIST} def-macro n ID \textsl{#REQUIRED}>
9808 <!\textsl{ATTLIST} def-macro npar CDATA \textsl{#REQUIRED}>
9809
9810 <!\textsl{ELEMENT} \textbf{section-rules} (rule+)>
9811
9812 <!\textsl{ELEMENT} \textbf{rule} (pattern, action)>
9813 <!\textsl{ATTLIST} rule comment CDATA \textsl{#IMPLIED}>
9814
9815 <!\textsl{ELEMENT} \textbf{pattern} (pattern-item+)>
9816
9817 <!\textsl{ELEMENT} \textbf{pattern-item} \textsl{EMPTY}>
9818 <!\textsl{ATTLIST} pattern-item n \textsl{IDREF} \textsl{#REQUIRED}>
9819
9820 <!\textsl{ELEMENT} \textbf{action} (\%sentence;)*>
9821
9822 <!\textsl{ELEMENT} \textbf{choose} (when+,otherwise?)>
9823
9824 <!\textsl{ELEMENT} \textbf{when} (test,(\%sentence;)*)>
9825
9826 <!\textsl{ELEMENT} \textbf{otherwise} (\%sentence;)+>
9827
9828 <!\textsl{ELEMENT} \textbf{test} (\%condition;)+>
9829
9830 <!\textsl{ELEMENT} \textbf{and} ((\%condition;),(\%condition;)+)>
9831
9832 <!\textsl{ELEMENT} \textbf{or} ((\%condition;),(\%condition;)+)>
9833
9834 <!\textsl{ELEMENT} \textbf{not} (\%condition;)>
9835
9836 <!\textsl{ELEMENT} \textbf{equal} (\%value;,\%value;)>
9837 <!\textsl{ATTLIST} equal caseless (no|yes) \textsl{#IMPLIED}>
9838
9839 <!\textsl{ELEMENT} \textbf{begins-with} (\%value;,\%value;)>
9840 <!\textsl{ATTLIST} begins-with caseless (no|yes) \textsl{#IMPLIED}>
9841
9842 <!\textsl{ELEMENT} \textbf{ends-with} (\%value;,\%value;)>
9843 <!\textsl{ATTLIST} ends-with caseless (no|yes) \textsl{#IMPLIED}>
9844
9845 <!\textsl{ELEMENT} \textbf{contains-substring} (\%value;,\%value;)>
9846 <!\textsl{ATTLIST} contains-substring caseless (no|yes) \textsl{#IMPLIED}>
9847
9848 <!\textsl{ELEMENT} \textbf{in} (\%value;, list)>
9849 <!\textsl{ATTLIST} in caseless (no|yes) \textsl{#IMPLIED}>
9850
9851 <!\textsl{ELEMENT} \textbf{list} \textsl{EMPTY}>
9852 <!\textsl{ATTLIST} list n \textsl{IDREF} \textsl{#REQUIRED}>
9853
9854 <!\textsl{ELEMENT} \textbf{let} (\%container;, \%value;)>
9855
9856 <!\textsl{ELEMENT} \textbf{append} (\%value;)+>
9857 <!\textsl{ATTLIST} append n \textsl{IDREF} \textsl{#REQUIRED}>
9858
9859 <!\textsl{ELEMENT} \textbf{out} (b|chunk)+>
9860
9861 <!\textsl{ELEMENT} \textbf{modify-case} (\%container;, \%stringvalue;)>
9862
9863 <!\textsl{ELEMENT} \textbf{call-macro} (with-param)*>
9864 <!\textsl{ATTLIST} call-macro n \textsl{IDREF} \textsl{#REQUIRED}>
9865
9866 <!\textsl{ELEMENT} \textbf{with-param} \textsl{EMPTY}>
9867 <!\textsl{ATTLIST} with-param pos CDATA \textsl{#REQUIRED}>
9868
9869 <!\textsl{ELEMENT} \textbf{clip} \textsl{EMPTY}>
9870 <!\textsl{ATTLIST} clip pos CDATA \textsl{#REQUIRED}
9871                part CDATA \textsl{#REQUIRED}>
9872
9873 <!\textsl{ELEMENT} \textbf{lit} \textsl{EMPTY}>
9874 <!\textsl{ATTLIST} lit v CDATA \textsl{#REQUIRED}>
9875
9876 <!\textsl{ELEMENT} \textbf{lit-tag} \textsl{EMPTY}>
9877 <!\textsl{ATTLIST} lit-tag v CDATA \textsl{#REQUIRED}>
9878
9879 <!\textsl{ELEMENT} \textbf{var} \textsl{EMPTY}>
9880 <!\textsl{ATTLIST} var n \textsl{IDREF} \textsl{#REQUIRED}>
9881
9882 <!\textsl{ELEMENT} \textbf{get-case-from} (clip|lit|var)>
9883 <!\textsl{ATTLIST} get-case-from pos CDATA \textsl{#REQUIRED}>
9884
9885 <!\textsl{ELEMENT} \textbf{case-of} \textsl{EMPTY}>
9886 <!\textsl{ATTLIST} case-of pos CDATA \textsl{#REQUIRED}
9887                   part CDATA \textsl{#REQUIRED}>
9888
9889 <!\textsl{ELEMENT} \textbf{concat} (\%value;)+>
9890
9891 <!\textsl{ELEMENT} \textbf{chunk} (\%value;)+>
9892
9893 <!\textsl{ELEMENT} \textbf{pseudolemma} (\%value;)>
9894
9895 <!\textsl{ELEMENT} \textbf{b} \textsl{EMPTY}>
9896 <!\textsl{ATTLIST} b pos CDATA \textsl{#IMPLIED}>
9897
9898 \end{alltt}
9899 \end{small}
9900
9901 \newpage
9902
9903 \section{DTD of the postchunk module}
9904 \label{ss:dtdpostchunk}
9905
9906 DTD for the format of the structural transfer rules in the
9907 \texttt{postchunk} module. This definition is provided with the
9908 \texttt{apertium} package (version 2.0) which can be downloaded from
9909 \url{http://www.sourceforge.net}.
9910
9911 Its elements are described in Section \ref{formatotransfer}.
9912
9913
9914
9915 \begin{small}
9916 \begin{alltt}
9917 <!\textsl{ENTITY} \% condition "(and|or|not|equal|begins-with|
9918                        ends-with|contains-substring|in)">
9919 <!\textsl{ENTITY} \% container "(var|clip)">
9920 <!\textsl{ENTITY} \% sentence "(let|out|choose|modify-case|
9921                       call-macro|append)">
9922 <!\textsl{ENTITY} \% value "(b|clip|lit|lit-tag|var|get-case-from|
9923                    case-of|concat)">
9924 <!\textsl{ENTITY} \% stringvalue "(clip|lit|var|get-case-from|
9925                          case-of)">
9926
9927 <!\textsl{ELEMENT} \textbf{postchunk} (section-def-cats,
9928                       section-def-attrs,
9929                       section-def-vars,
9930                       section-def-lists?,
9931                       section-def-macros?,
9932                       section-rules)>
9933
9934 <!\textsl{ELEMENT} \textbf{section-def-cats} (def-cat+)>
9935
9936 <!\textsl{ELEMENT} \textbf{def-cat} (cat-item+)>
9937 <!\textsl{ATTLIST} def-cat n ID \textsl{#REQUIRED}>
9938
9939 <!\textsl{ELEMENT} \textbf{cat-item} \textsl{EMPTY}>
9940 <!\textsl{ATTLIST} cat-item name CDATA \textsl{#REQUIRED}>
9941
9942 <!\textsl{ELEMENT} \textbf{section-def-attrs} (def-attr+)>
9943
9944 <!\textsl{ELEMENT} \textbf{def-attr} (attr-item+)>
9945 <!\textsl{ATTLIST} def-attr n ID \textsl{#REQUIRED}>
9946
9947 <!\textsl{ELEMENT} \textbf{attr-item} \textsl{EMPTY}>
9948 <!\textsl{ATTLIST} attr-item tags CDATA \textsl{#IMPLIED}>
9949
9950 <!\textsl{ELEMENT} \textbf{section-def-vars} (def-var+)>
9951
9952 <!\textsl{ELEMENT} \textbf{def-var} \textsl{EMPTY}>
9953 <!\textsl{ATTLIST} def-var n ID \textsl{#REQUIRED}>
9954
9955 <!\textsl{ELEMENT} \textbf{section-def-lists} (def-list)+>
9956
9957 <!\textsl{ELEMENT} \textbf{def-list} (list-item+)>
9958 <!\textsl{ATTLIST} def-list n ID \textsl{#REQUIRED}>
9959
9960 <!\textsl{ELEMENT} \textbf{list-item} \textsl{EMPTY}>
9961 <!\textsl{ATTLIST} list-item v CDATA \textsl{#REQUIRED}>
9962
9963 <!\textsl{ELEMENT} \textbf{section-def-macros} (def-macro)+>
9964
9965 <!\textsl{ELEMENT} \textbf{def-macro} (\%sentence;)+>
9966 <!\textsl{ATTLIST} def-macro n ID \textsl{#REQUIRED}>
9967 <!\textsl{ATTLIST} def-macro npar CDATA \textsl{#REQUIRED}>
9968
9969 <!\textsl{ELEMENT} \textbf{section-rules} (rule+)>
9970
9971 <!\textsl{ELEMENT} \textbf{rule} (pattern, action)>
9972 <!\textsl{ATTLIST} rule comment CDATA \textsl{#IMPLIED}>
9973
9974 <!\textsl{ELEMENT} \textbf{pattern} (pattern-item+)>
9975
9976 <!\textsl{ELEMENT} \textbf{pattern-item} \textsl{EMPTY}>
9977 <!\textsl{ATTLIST} pattern-item n \textsl{IDREF} \textsl{#REQUIRED}>
9978
9979 <!\textsl{ELEMENT} \textbf{action} (\%sentence;)*>
9980
9981 <!\textsl{ELEMENT} \textbf{choose} (when+,otherwise?)>
9982
9983 <!\textsl{ELEMENT} \textbf{when} (test,(\%sentence;)*)>
9984
9985 <!\textsl{ELEMENT} \textbf{otherwise} (\%sentence;)+>
9986
9987 <!\textsl{ELEMENT} \textbf{test} (\%condition;)+>
9988
9989 <!\textsl{ELEMENT} \textbf{and} ((\%condition;),(\%condition;)+)>
9990
9991 <!\textsl{ELEMENT} \textbf{or} ((\%condition;),(\%condition;)+)>
9992
9993 <!\textsl{ELEMENT} \textbf{not} (\%condition;)>
9994
9995 <!\textsl{ELEMENT} \textbf{equal} (\%value;,\%value;)>
9996 <!\textsl{ATTLIST} equal caseless (no|yes) \textsl{#IMPLIED}>
9997
9998 <!\textsl{ELEMENT} \textbf{begins-with} (\%value;,\%value;)>
9999 <!\textsl{ATTLIST} begins-with caseless (no|yes) \textsl{#IMPLIED}>
10000
10001 <!\textsl{ELEMENT} \textbf{ends-with} (\%value;,\%value;)>
10002 <!\textsl{ATTLIST} ends-with caseless (no|yes) \textsl{#IMPLIED}>
10003
10004 <!\textsl{ELEMENT} \textbf{contains-substring} (\%value;,\%value;)>
10005 <!\textsl{ATTLIST} contains-substring caseless (no|yes) \textsl{#IMPLIED}>
10006
10007 <!\textsl{ELEMENT} \textbf{in} (\%value;, list)>
10008 <!\textsl{ATTLIST} in caseless (no|yes) \textsl{#IMPLIED}>
10009
10010 <!\textsl{ELEMENT} \textbf{list} \textsl{EMPTY}>
10011 <!\textsl{ATTLIST} list n \textsl{IDREF} \textsl{#REQUIRED}>
10012
10013 <!\textsl{ELEMENT} \textbf{let} (\%container;, \%value;)>
10014
10015 <!\textsl{ELEMENT} \textbf{append} (\%value;)+>
10016 <!\textsl{ATTLIST} append n \textsl{IDREF} \textsl{#REQUIRED}>
10017
10018 <!\textsl{ELEMENT} \textbf{out} (b|lu|mlu)+>
10019
10020 <!\textsl{ELEMENT} \textbf{modify-case} (\%container;, \%stringvalue;)>
10021
10022 <!\textsl{ELEMENT} \textbf{call-macro} (with-param)*>
10023 <!\textsl{ATTLIST} call-macro n \textsl{IDREF} \textsl{#REQUIRED}>
10024
10025 <!\textsl{ELEMENT} \textbf{with-param} \textsl{EMPTY}>
10026 <!\textsl{ATTLIST} with-param pos CDATA \textsl{#REQUIRED}>
10027
10028 <!\textsl{ELEMENT} \textbf{clip} \textsl{EMPTY}>
10029 <!\textsl{ATTLIST} clip pos CDATA \textsl{#REQUIRED}
10030                part CDATA \textsl{#REQUIRED}>
10031
10032 <!\textsl{ELEMENT} \textbf{lit} \textsl{EMPTY}>
10033 <!\textsl{ATTLIST} lit v CDATA \textsl{#REQUIRED}>
10034
10035 <!\textsl{ELEMENT} \textbf{lit-tag} \textsl{EMPTY}>
10036 <!\textsl{ATTLIST} lit-tag v CDATA \textsl{#REQUIRED}>
10037
10038 <!\textsl{ELEMENT} \textbf{var} \textsl{EMPTY}>
10039 <!\textsl{ATTLIST} var n \textsl{IDREF} \textsl{#REQUIRED}>
10040
10041 <!\textsl{ELEMENT} \textbf{get-case-from} (clip|lit|var)>
10042 <!\textsl{ATTLIST} get-case-from pos CDATA \textsl{#REQUIRED}>
10043
10044 <!\textsl{ELEMENT} \textbf{case-of} \textsl{EMPTY}>
10045 <!\textsl{ATTLIST} case-of pos CDATA \textsl{#REQUIRED}
10046                   part CDATA \textsl{#REQUIRED}>
10047
10048 <!\textsl{ELEMENT} \textbf{concat} (\%value;)+>
10049
10050 <!\textsl{ELEMENT} \textbf{mlu} (lu+)>
10051
10052 <!\textsl{ELEMENT} \textbf{lu} (\%value;)+>
10053
10054 <!\textsl{ELEMENT} \textbf{b} \textsl{EMPTY}>
10055 <!\textsl{ATTLIST} b pos CDATA \textsl{#IMPLIED}>
10056
10057 \end{alltt}
10058 \end{small}
10059
10060 \newpage
10061
10062
10063 \section[DTD for the format rules]{DTD for the format specification
10064 rules}
10065 \label{ss:dtd_formato}
10066
10067 DTD for the format specification rules. This definition can be
10068 downloaded from the web page
10069 \url{http://cvs.sourceforge.net/viewcvs.py/apertium/apertium/apertium/format.dtd}. \nota{needs
10070 updating}
10071
10072
10073 Its elements are described in Section \ref{ss:reglasformato}.
10074
10075 \begin{small}
10076 \begin{alltt}
10077 <!\textsl{ELEMENT} \textbf{format} (options,rules)>
10078 <!\textsl{ATTLIST} format name \textsl{CDATA} \textsl{#REQUIRED}>
10079
10080 <!\textsl{ELEMENT} \textbf{options} (largeblocks, input, output,
10081                    escape-chars, space-chars, case-sensitive)>
10082
10083 <!\textsl{ELEMENT} \textbf{largeblocks} \textsl{EMPTY}>
10084 <!\textsl{ATTLIST} largeblocks size \textsl{CDATA} \textsl{#REQUIRED}>
10085
10086 <!\textsl{ELEMENT} \textbf{input} \textsl{EMPTY}>
10087 <!\textsl{ATTLIST} input zip-path \textsl{CDATA} \textsl{#IMPLIED}
10088                 encoding \textsl{CDATA} \textsl{#REQUIRED}>
10089
10090 <!\textsl{ELEMENT} \textbf{output} \textsl{EMPTY}>
10091 <!\textsl{ATTLIST} output zip-path \textsl{CDATA} \textsl{#IMPLIED}
10092                  encoding \textsl{CDATA} \textsl{#REQUIRED}>
10093
10094 <!\textsl{ELEMENT} \textbf{escape-chars} \textsl{EMPTY}>
10095 <!\textsl{ATTLIST} escape-chars regexp \textsl{CDATA} \textsl{#REQUIRED}>
10096
10097 <!\textsl{ELEMENT} \textbf{space-chars} \textsl{EMPTY}>
10098 <!\textsl{ATTLIST} space-chars regexp \textsl{CDATA} \textsl{#REQUIRED}>
10099
10100 <!\textsl{ELEMENT} \textbf{case-sensitive} \textsl{EMPTY}>
10101 <!\textsl{ATTLIST} case-sensitive value (yes|no) \textsl{#REQUIRED}>
10102
10103 <!\textsl{ELEMENT} \textbf{rules} (format-rule|replacement-rule)+>
10104
10105 <!\textsl{ELEMENT} \textbf{format-rule} (begin-end|(begin,end))>
10106 <!\textsl{ATTLIST} format-rule eos (yes|no) \textsl{#IMPLIED}
10107                       priority \textsl{CDATA} \textsl{#REQUIRED}>
10108
10109 <!\textsl{ELEMENT} \textbf{begin-end} \textsl{EMPTY}>
10110 <!\textsl{ATTLIST} begin-end regexp \textsl{CDATA} \textsl{#REQUIRED}>
10111
10112 <!\textsl{ELEMENT} \textbf{begin} \textsl{EMPTY}>
10113 <!\textsl{ATTLIST} begin regexp \textsl{CDATA} \textsl{#REQUIRED}>
10114
10115 <!\textsl{ELEMENT} \textbf{end} \textsl{EMPTY}>
10116 <!\textsl{ATTLIST} end regexp \textsl{CDATA} \textsl{#REQUIRED}>
10117
10118 <!\textsl{ELEMENT} \textbf{replacement-rule} (replace+)>
10119 <!\textsl{ATTLIST} replacement-rule regexp \textsl{CDATA} \textsl{#REQUIRED}>
10120
10121 <!\textsl{ELEMENT} \textbf{replace} \textsl{EMPTY}>
10122 <!\textsl{ATTLIST} replace source \textsl{CDATA} \textsl{#REQUIRED}
10123                   target \textsl{CDATA} \textsl{#REQUIRED}
10124                   prefer (yes|no) \textsl{#IMPLIED}>
10125
10126 \end{alltt}
10127 \end{small}
10128
10129 \newpage
10130 \section{DTD for the form paradigms}
10131 \label{ss:dtdparadigmes}
10132
10133 DTD for the format of the paradigm files used in the forms. This
10134 definition is included in the package
10135 \texttt{apertium-lexical-webform}.
10136
10137 \begin{small}
10138 \begin{alltt}
10139
10140
10141 <!\textsl{ELEMENT} \textbf{form} (entry)+>
10142
10143 <!\textsl{ATTLIST} \textbf{form}
10144         lang CDATA \textsl{#REQUIRED}
10145         langpair CDATA \textsl{#REQUIRED}>
10146
10147 <!\textsl{ELEMENT} \textbf{entry} (endings, paradigms)+>
10148
10149 <!\textsl{ATTLIST} \textbf{entry}
10150         PoS CDATA \textsl{#REQUIRED}
10151         nbr CDATA \textsl{#IMPLIED}
10152         gen CDATA \textsl{#IMPLIED}>
10153
10154 <!\textsl{ELEMENT} \textbf{endings} (stem, ending+)>
10155
10156 <!\textsl{ELEMENT} \textbf{stem} (\textsl{#PCDATA})>
10157
10158 <!\textsl{ELEMENT} \textbf{ending} (\textsl{#PCDATA})>
10159
10160 <!\textsl{ELEMENT} \textbf{paradigms} (par+)>
10161
10162 <!\textsl{ATTLIST} \textbf{paradigms} howmany CDATA \textsl{#REQUIRED}>
10163
10164 <!\textsl{ELEMENT} \textbf{par} \textsl{EMPTY}>
10165
10166 <!\textsl{ATTLIST} \textbf{par} n CDATA \textsl{#REQUIRED}>
10167
10168
10169 \end{alltt}
10170 \end{small}
10171
10172
10173
10174 \chapter[Grammatical symbols]{Grammatical symbols used in the modules}
10175 \label{se:simbolosmorf}
10176
10177
10178
10179 \section[Dictionary symbols]{Grammatical symbols used in dictionaries}
10180
10181 \subsection{List of symbols}
10182
10183
10184 \begin{tabular}{ll}
10185
10186 \textbf{aa} & adjective-adjective (function of relative pronoun) \\
10187  \textbf{acr} & acronym \\ \textbf{al} & others (for proper nouns) \\
10188  \textbf{an} & adjective-noun (function of relative pronoun) \\
10189  \textbf{ant} & antroponym \\ \textbf{cni} & conditional \\
10190  \textbf{cnjadv} & adverbial conjunction\\ \textbf{cnjcoo} &
10191  co-ordinating conjunction\\ \textbf{cnjsub} & subordinating
10192  conjunction\\ \textbf{def} & definite \\ \textbf{dem} & demonstrative
10193  \\ \textbf{det} & determiner\\ \textbf{detnt} & neuter determiner \\
10194  \textbf{enc} & enclitic\\ \textbf{f} & feminine\\ \textbf{fti} &
10195  future indicative\\ \textbf{fts} & future subjunctive\\ \textbf{ger}
10196  & gerund\\ \textbf{ifi} & perfect preterite\\ \textbf{ij} &
10197  interjection\\ \textbf{imp} & imperative\\ \textbf{ind} &
10198  indefinite\\ \textbf{inf} & infinitive\\
10199
10200 \end{tabular} \newpage
10201
10202 \begin{tabular}{ll} \textbf{itg} & interrogative\\
10203 \textbf{loc} & locative\\
10204 \textbf{lpar} & ([\\ \textbf{lquest} & ¿\\ \textbf{m} & masculine\\
10205 \textbf{mf} & masculine-feminine\\ \textbf{n} & noun\\ \textbf{nn} &
10206 noun-noun (function of relative pronoun)\\ \textbf{np} & proper noun\\
10207 \textbf{nt} & neuter\\ \textbf{num} & numeral - number\\ \textbf{p1} &
10208 first person\\ \textbf{p2} & second person\\ \textbf{p3} & third
10209 person\\ \textbf{pii} & imperfect preterite indicative \\ \textbf{pis}
10210 & imperfect preterite subjunctive \\ \textbf{pl} & plural \\
10211 \textbf{pos} & possessive\\ \textbf{pp} & participle\\ \textbf{pr} &
10212 preposition\\ \textbf{preadv} & preadverb\\ \textbf{predet} &
10213 predeterminer\\ \textbf{pri} & present indicative\\ \textbf{prn} &
10214 pronoun\\ \textbf{pro} & proclitic\\ \textbf{prs} & present
10215 subjunctive\\ \textbf{ref} & reflexive\\ \textbf{rel} & relative\\
10216 \textbf{rpar} & )]\\ \textbf{sent} & . ? ; : ! \\ \textbf{sg} &
10217 singular\\ \textbf{sp} & singular-plural\\ \textbf{sup} &
10218 superlative\\ \textbf{tn} & tonic\\ \textbf{vaux} & auxiliary verb\\
10219 \textbf{vbhaver}& verb \emph{to have}\\ \textbf{vblex} & lexical
10220 verb\\ \textbf{vbmod} & modal verb\\ \textbf{vbser} & verb \emph{to
10221 be}
10222
10223 \end{tabular}
10224
10225 \newpage
10226 \subsection{Specification of lexical forms}
10227
10228 Order for the placement of grammatical symbols in the morphological
10229 dictionaries of this system (from left to right in the table). The
10230 examples in brackets are from Spanish. \\ \\
10231
10232 \begin{footnotesize}
10233 \begin{tabular}{|l|llllll|}
10234
10235 \hline \textbf{Common adjectives} & \textbf{PoS} & \textbf{Gender} &
10236 \textbf{Number} &&& \\\cline{2-7}
10237 (difícil, rojo) & adj & m & sg &&& \\ & & f & pl &&& \\ & & mf & sp
10238                   &&& \\ \hline \textbf{Interrogative, possessive,} &
10239                   \textbf{PoS} & \textbf{Type} & \textbf{Gender}
10240                   &\textbf{Number}&& \\\cline{2-7}
10241 \textbf{indetermined and superlative} & adj & itg & m & sg &&\\
10242  \textbf{adjectives} & & pos & f & pl &&\\ (qué, tus, otra, buenísimo)
10243   & & ind & mf & sp &&\\ & & sup & & &&\\\hline
10244
10245
10246 \textbf{Adverbs} & \textbf{PoS} &&&&&\\\cline{2-7} (siempre, mañana)&
10247 adv &&&&&\\\hline
10248
10249 \textbf{Preadverbs} & \textbf{PoS} &&&&&\\\cline{2-7} (muy, tan)&
10250 preadv &&&&&\\\hline
10251
10252 \textbf{Interrogative adverbs} & \textbf{PoS}
10253 &\textbf{Type}&&&&\\\cline{2-7} (dónde) & adv & itg &&&&\\\hline
10254
10255 \textbf{Adverbial conjunctions} & \textbf{PoS} &&&&&\\\cline{2-7}
10256 (que, así como) & cnjadv &&&&&\\ & cnjcoo &&&&&\\ & cnjsub
10257 &&&&&\\\hline
10258
10259
10260 \textbf{Determiners} & \textbf{PoS} & \textbf{Type} & \textbf{Gender}
10261 &\textbf{Number}&& \\\cline{2-7} (el, uno, este, mi) & det & def & m &
10262 sg &&\\ & & ind & f & pl &&\\ & & dem & mf & sp &&\\ & & pos & &
10263 &&\\\hline
10264
10265 \textbf{Neuter determiners} & \textbf{PoS} &&&&&\\\cline{2-7} (lo)&
10266 detnt &&&&&\\\hline
10267
10268 \textbf{Predeterminers} & \textbf{PoS} & \textbf{Gender} &
10269 \textbf{Number} &&& \\\cline{2-7} (todos) & predet & m & sg &&&\\ & &
10270 f & pl &&&\\ & & nt & sp &&&\\\hline \textbf{Interjections} &
10271 \textbf{PoS} &&&&&\\\cline{2-7} (hola) & ij &&&&&\\\hline
10272
10273
10274
10275
10276 \textbf{Common nouns}& \textbf{PoS} & \textbf{Gender} &
10277 \textbf{Number} &&& \\\cline{2-7} (casa, perro) & n & m & sg &&&\\ & n
10278 & f & pl &&&\\ & n & mf & sp &&&\\\hline
10279
10280 \textbf{Proper nouns}& \textbf{PoS} &\textbf{Type}&&&&\\\cline{2-7}
10281 (Pedro, Londres) & np & ant &&&&\\ & & loc &&&&\\ & & al &&&&\\\hline
10282
10283 \end{tabular} \newpage
10284 \begin{tabular}{|l|llllll|} \hline
10285
10286 \textbf{Acronyms} & \textbf{PoS} & \textbf{Type} & \textbf{Gender} &
10287 \textbf{Number} && \\\cline{2-7} (IRPF, INEM) & n & acr & m & sg &&\\
10288 & & & f & pl &&\\ & & & mf & sp &&\\\hline
10289
10290
10291 \textbf{Numerals} & \textbf{PoS} & \textbf{Gender} & \textbf{Number}
10292 &&& \\\cline{2-7} (tres) & num & m & sg &&& \\ & & f & pl &&& \\ & &
10293 mf & sp &&& \\\hline
10294
10295 \textbf{Prepositions} & \textbf{PoS} &&&&&\\\cline{2-7} (de, por) & pr
10296 &&&&&\\\hline
10297
10298 \textbf{Interrogative pronouns} & \textbf{PoS} & \textbf{Type} &
10299 \textbf{Gender} &\textbf{Number}&& \\\cline{2-7} (quién, qué) & prn &
10300 itg & m & sg &&\\ & & & f & pl &&\\\hline
10301
10302
10303 \textbf{Enclitic, proclitic and} & \textbf{PoS} & \textbf{Type} &
10304 \textbf{Person}& \textbf{Gender} &\textbf{Number}& \\\cline{2-7}
10305 \textbf{tonic personal} & prn & enc & p1 & m & sg &\\
10306    \textbf{pronouns} & & pro & p2 & f & pl &\\ (yo, vosotros,
10307    ayudar\textbf{te}, & & tn & p3 & mf & sp & \\ \textbf{te} ayudo) &
10308    & & & nt && \\ & & & & & & \\\cline{2-7}
10309 \textbf{Procl. reflexive pron.} (se): & prn & pro & ref & p3 & mf &
10310 sp\\\cline{2-7} \textbf{Tonic reflex. pron.} (si): & prn & tn & ref &
10311 p3 & mf & sp\\\hline
10312
10313
10314
10315 \textbf{Tonic possessive pron.} & \textbf{PoS} & \textbf{Type} &
10316 \textbf{Subtype}& \textbf{Gender} &\textbf{Number}& \\\cline{2-7}
10317 (mío, suyo) & prn & tn & pos & m & sg &\\ & & & & f & pl &\\\hline
10318
10319
10320 \textbf{Other tonic pronouns} & \textbf{PoS} & \textbf{Type} &
10321 \textbf{Gender} &\textbf{Number}&& \\\cline{2-7} (aquella, nadie,
10322 otro) & prn & tn & m & sg &&\\ & & & f & pl &&\\ & & & mf & sp && \\ &
10323 & & nt &&& \\\hline
10324
10325
10326 \textbf{Pronominal and adjectival} & \textbf{PoS} & \textbf{Type} &
10327 \textbf{Gender} & \textbf{Number} && \\\cline{2-7} \textbf{relatives}
10328 & rel & nn & m & sg &&\\ (que, cuyo) & & an & f & pl &&\\ & & aa & f &
10329 pl &&\\\hline
10330
10331 \textbf{Adverbial relatives} & \textbf{PoS} & \textbf{Type} & & &&
10332 \\\cline{2-7} (como, donde) & rel & adv & & &&\\\hline
10333
10334
10335 \textbf{Verbs} & \textbf{Type} & \textbf{Tense} & \textbf{Person}
10336 &\textbf{Number}&& \\ \textbf{(personal forms)} & & \textbf{and mode}
10337 & & && \\\cline{2-7} (subo, vamos) & vblex & cni & p1 & sg &&\\ &
10338 vbser & fti & p2 & pl &&\\ & vbhaver & fts & p3 & &&\\ & vbmod & ifi &
10339 & &&\\ & & imp & & &&\\ & & pii & & &&\\ & & pis & & &&\\ & & pri & &
10340 &&\\ & & prs & & &&\\\hline
10341
10342
10343 \textbf{Verbs} & \textbf{Type} & \textbf{Form} & & && \\\cline{2-7}
10344 \textbf{(infinitive and gerund)} & vblex & inf & & &&\\ (cantar,
10345 buscando) & vbser & ger & & &&\\ & vbhaver & & & &&\\ & vbmod & & &
10346 &&\\\hline
10347
10348
10349
10350 \textbf{Verbs} & \textbf{Type} & \textbf{Form} &\textbf{Gender}
10351 &\textbf{Number} && \\\cline{2-7} \textbf{(participle)} & vblex & pp &
10352 m & sg &&\\ (dormido, cansadas) & vbser & & f & pl &&\\ & vbhaver & &
10353 & &&\\ & vbmod & & & &&\\\hline
10354
10355
10356
10357 \end{tabular}
10358 \end{footnotesize}
10359
10360
10361 \newpage
10362 \section{Categories used in the part-of-speech tagger}
10363 \subsection{Spanish tagger}
10364
10365 These are the categories or coarse tags used by the Spanish
10366 part-of-speech tagger.
10367
10368
10369 \begin{footnotesize}
10370 \begin{longtable}{l|l|c|l} \hline \bf{Tag} & \bf{Description} &
10371 \bf{Closed} & \bf{Examples} \\ \hline \hline
10372 \endhead \multicolumn{4}{c}{\bf{Simple tags}} \\ \hline \hline PARAPR
10373 & Lexicalization of \emph{para} as a preposition & Yes & \\ \hline
10374 PARAVBPRI & Lexicalization of \emph{para} as a lexical verb & & \\ &
10375 in present indicative & Yes& \\ \hline PARAVBIMP & Lexicalization of
10376 \emph{para} as a lexical verb & & \\ & in imperative & Yes& \\ \hline
10377 QUECNJ & Lexicalization of \emph{que} as a conjunction & Yes& \\
10378 \hline QUEREL & Lexicalization of \emph{que} as a relative pronoun &
10379 Yes& \\ \hline COMOPR\footnote{The morphological analyser considers
10380 that \emph{como} can be a preposition since it can be replaced with
10381 \emph{en calidad de} in some contexts (e.g.- \emph{'Os hablo como
10382 director de la película'}).} & Lexicalization of \emph{como} as a
10383        preposition& Yes& \\
10384 %!!!!!!!!!!Explicar esto de la preposición porque no es muy estándar que digamos
10385 \hline COMOREL & Lexicalization of \emph{como} as a
10386 relative pronoun & Yes& \\ \hline COMOVB & Lexicalization of
10387 \emph{como} as a lexical verb & & \\ & in present indicative & Yes& \\
10388 \hline MASADV & Lexicalization of \emph{más}/\emph{menos} as an adverb
10389 & Yes& \\ \hline MASADJ & Lexicalization of \emph{más}/\emph{menos} as
10390 an adjective & Yes& \\ \hline MASNP & Lexicalization of \emph{Más} as
10391 a proper noun & Yes& \\ \hline ALGOADV & Lexicalization of \emph{algo}
10392 as an adverb & Yes& \\ \hline ACRONIMOM & Acronym & No& BCH\\ \hline
10393 ACRONIMOF & Acronym & No& ONU\\ \hline ACRONIMOMF & Acronym & No&
10394 ATS\\ \hline INTNOM & Interrogative pronoun & Yes& quién, cuál\\
10395 \hline ADJINT & Interrogative adjective & Yes& cuánto, qué\\ \hline
10396 INTADV & Interrogative adverb & Yes& cuándo, dónde\\ \hline PREADV &
10397 Adverb that can precede another & &\\& adverb or an adjective & Yes&
10398 muy, bien, mal\\ \hline ADV & Adverb & No& nunca, ahí\\ \hline CNJSUBS
10399 & Subordinating conjunction & Yes& que\\
10400 %!!!!!!! No hay más conjunciones subordinadas a parte de que?????
10401 \hline CNJCOORD &
10402 Co-ordinating conjunction & Yes& y, pero\\ \hline CNJADV & Adverbial
10403 conjunction & No& si\\ \hline DETNT & Neuter determiner & Yes& lo\\
10404 \hline DETM & Determiner & Yes& el, un\\ \hline DETF & Determiner &
10405 Yes& la, una\\ \hline DETMF & Determiner & Yes& cada\\ \hline INTERJ &
10406 Interjection & No& ojalá, hola\\ \hline NOM & Noun & No& casa, coche\\
10407 \hline ANTROPONIM & Proper noun for person & No& Fernando\\ \hline
10408 TOPONIM & Proper noun for place & No& Alicante\\ \hline NPALTRES &
10409 Other proper nouns & No& Linux, Seat\\ \hline NUM & Numeral & Yes&
10410 tres, cuatro\\ \hline PREDETNT & Neuter predeterminer & Yes& todo\\
10411 \hline PREDET & Predeterminer & Yes& toda\\ \hline PREP & Preposition
10412 & Yes& ante, desde\\ \hline PRNTNNT & Neuter tonic pronoun & Yes&
10413 algo, esto\\ \hline PRNTN & Tonic pronoun & Yes& ambos, nadie\\ \hline
10414 PRNENCREF & Reflexive enclitic pronoun & Yes& se \\ \hline PRNPROREF &
10415 Reflexive proclitic pronoun & Yes& se \\ \hline PRNENC & Enclitic
10416 pronoun & Yes& me, nos\\ \hline PRNPRO & Proclitic pronoun & Yes& le,
10417 te\\ \hline VLEXINF & Lexical verb in infinitive & No& cantar, reír\\
10418 \hline VLEXGER & Lexical verb in gerund & No& hablando\\ \hline
10419 VLEXPARTPI & Lexical verb in participle & No& dicho, cantado\\ \hline
10420 VLEXPFCI & Lexical verb in present, future or & & \\ & conditional
10421 indicative & No& digo, diré, diría\\ \hline VLEXIPI & Lexical verb in
10422 imperfect preferite or & & \\ & perfect preterite indicative & No&
10423 cantaba, dijo\\ \hline VLEXSUBJ & Lexical verb in subjunctive & No&
10424 hablase, dijeramos\\ \hline VLEXIMP & Lexical verb in imperative & No&
10425 canta, comed\\ \hline VSERINF & Verb \emph{to be} in infinitive & Yes&
10426 ser\\ \hline VSERGER & Verb \emph{to be} in gerund & Yes& siendo\\
10427 \hline VSERPARTPI & Verb \emph{to be} in participle & Yes& sido\\
10428 \hline VSERPFCI & Verb \emph{to be} in present, future or & & \\ &
10429 conditional indicative & Yes& soy, seré, sería\\ \hline VSERIPI & Verb
10430 \emph{to be} in imperfect preterite or & & \\ & perfect preterite
10431 indicative & Yes& era, fui\\ \hline VSERSUBJ & Verb \emph{to be} in
10432 subjunctive & Yes& fueras\\ \hline VSERIMP & Verb \emph{to be} in
10433 imperative & Yes& sé\\ \hline VHABERINF & Verb \emph{to have} in
10434 infinitive & Yes& haber\\ \hline VHABERGER & Verb \emph{to have} in
10435 gerund & Yes& habiendo\\ \hline VHABERPARTPI & Verb \emph{to have} in
10436 participle & Yes& habido\\ \hline VHABERPFCI & Verb \emph{to have} in
10437 present, future or & & \\ & conditional indicative & Yes& hay, habrán,
10438 habría\\ \hline VHABERIPI & Verb \emph{to have} in imperfect preterite
10439 or & & \\ & perfect preterite indicative & Yes& había, hubo\\ \hline
10440 VHABERSUBJ & Verb \emph{to have} in subjunctive & Yes& hubieran\\
10441 \hline VMODALINF & Modal verb in infinitive & Yes& deber, poder\\
10442 \hline VMODALGER & Modal verb in gerund & Yes& debiendo\\ \hline
10443 VMODALPARTPI & Modal verb in participle & Yes& podido\\ \hline
10444 VMODALPFCI & Modal verb in present, future or & & \\ & conditional
10445 indicative & Yes& puede, deberá, podría\\ \hline VMODALIPI & Modal
10446 verb in imperfect preterite or & & \\ & perfect preterite indicative &
10447 Yes& podía, debió\\ \hline VMODALSUBJ & Modal verb in subjunctive &
10448 Yes& pudiese, debiéramos\\ \hline VMODALIMP & Modal verb in imperative
10449 & Yes& poded, debed\\ \hline ADJM & Adjective & No& gracioso\\ \hline
10450 ADJF & Adjective & No& graciosa\\ \hline ADJMF & Adjective & No&
10451 inteligente\\ \hline ADJPOS & Possessive adjective & Yes& mío\\ \hline
10452 REL & Relative pronoun & Yes& quien, cuya\\ \hline RELADV & Adverbial
10453 relative & Yes& cuando, donde\\ \hline \hline
10454 \multicolumn{4}{c}{\bf{Compound tags}} \\ \hline \hline PREPDET &
10455 Contraction of preposition and determiner & Yes& del, al\\ \hline
10456 PRCNJ & Multiword made of preposition and & & \\ &conjunction & Yes& a
10457 que\\ \hline PRREL & Multiword made of preposition and & & \\
10458 &relative & Yes& en que\\ \hline INFLEXPRNENC & Lexical verb in
10459 infinitive with enclitics & No& dármelo, cantarlo\\ \hline
10460 GERLEXPRNENC & Lexical verb in gerund with enclitics & No&
10461 cantándosela\\ \hline IMPLEXPRNENC & Lexical verb in imperative with
10462 enclitics & No& dímelo\\ \hline INFSERPRNENC & Verb \emph{to be} in
10463 infinitive with enclitics & Yes& serlo\\ \hline GERSERPRNENC & Verb
10464 \emph{to be} in gerund with enclitics & Yes& siéndolo\\ \hline
10465 IMPSERPRNENC & Verb \emph{to be} in imperative with enclitics & Yes&
10466 sedlo\\ \hline INFHABPRNENC & Verb \emph{to have} in infinitive with
10467 enclitics & Yes& habérsela\\ \hline GERHABPRNENC & Verb \emph{to have}
10468 in gerund with enclitics & Yes& habiéndole\\ \hline INFMODPRNENC &
10469 Modal verb in infinitive with enclitics & Yes& poderla, deberlo\\
10470 \hline GERMODPRNENC & Modal verb in gerund with enclitics& Yes&
10471 debiéndosela\\ \hline IMPMODPRNENC & Modal verb in imperative with
10472 enclitics& Sí& debédmela\\ \hline \hline \multicolumn{4}{c}{\bf{Other
10473 tags}} \\ \hline \hline LQUEST & Opening question mark & & ¿ \\ \hline
10474 LPAR & Opening parenthesis or square bracket & & (, [ \\ \hline RPAR &
10475 Closing parenthesis or square bracket & & ), ] \\ \hline CM & Comma &
10476 & , \\ \hline SENT & Sentence end character & & ., :, ;, ?, !\\ \hline
10477 \hline \multicolumn{4}{l}{}\\ %p{0.50\textwidth}
10478
10479 \end{longtable}
10480 \end{footnotesize}
10481
10482 \subsection{Catalan tagger}
10483
10484 Due to the similarity of the Catalan tagger categories and the Spanish
10485 ones, we list here only the tags that are new or different in the
10486 Catalan tagger.
10487
10488 \begin{footnotesize}
10489 \begin{longtable}{l|l|c|l} \hline \bf{Tag} & \bf{Description} &
10490 \bf{Closed} & \bf{Examples} \\ \hline \hline
10491 \endhead \multicolumn{4}{c}{\bf{Simple tags}} \\ \hline \hline MOLTADV
10492 & Lexicalization of \emph{molt}/\emph{gaire} as an adverb & Yes & \\
10493 \hline MOTLPREADV & Lexicalization of \emph{molt}/\emph{gaire} as an
10494 adverb & Yes& \\ \hline VOLERMOD & Lexicalization of \emph{voler} as a
10495 modal verb & Yes& \\ \hline VOLERLEX & Lexicalization of \emph{voler}
10496 as a lexical verb & Yes& \\ \hline VA & Lexicalization of \emph{va} as
10497 a form of the verb \emph{anar} & Yes& \\ \hline \multicolumn{4}{l}{}\\
10498 %p{0.50\textwidth}
10499 \end{longtable}
10500 \end{footnotesize}
10501
10502 \subsection{Galician tagger}
10503
10504
10505 Due to the similarity of the Galician tagger categories and the
10506 Spanish ones, we list here only the tags that are new or different in
10507 the Galician tagger.
10508
10509
10510 \begin{footnotesize}
10511 \begin{longtable}{l|l|c|l} \hline \bf{Tag} & \bf{Description} &
10512 \bf{Closed} & \bf{Examples} \\ \hline \hline
10513 \endhead \multicolumn{4}{c}{\bf{Simple tags}} \\ \hline \hline VBIRNPS
10514 & Lexicalization of \emph{to go} in infinitive & & \\ & and gerund &
10515 Yes & \\ \hline VBIRPARTPI & Lexicalization of \emph{to go} in
10516 participle & Yes& \\ \hline VBIRPS & Lexicalization of \emph{to go} in
10517 the personal forms & & \\ & of indicative & & \\ & and subjunctive &
10518 Yes& \\ \hline VBIRIMP & Lexicalization of \emph{to go} in imperative
10519 & Yes& \\ \hline VHABERNPS & Lexicalization of \emph{to have} in
10520 infinitive & & \\ & and gerund & Yes & \\ \hline VHABERPARTPI &
10521 Lexicalization of \emph{to have} in participle & Yes& \\ \hline
10522 VHABERPS & Lexicalization of \emph{to have} in the personal forms & &
10523 \\ & of indicative & & \\ & and subjunctive & Yes& \\ \hline VHABERIMP
10524 & Lexicalization of \emph{to have} in imperative & Yes& \\ \hline
10525 APREP & Lexicalization of \emph{a} as a preposition & Yes& \\ \hline
10526 VLEXNPS & Lexical verb: infinitive and gerund & No& achegar,
10527 achegándomos\\ \hline VLEXPS & Lexical verb: personal forms & & \\ &
10528 in indicative & No& achegue, achegaré\\ \hline VSERNPS & Verb \emph{to
10529 be}: infinitive and gerund & Yes& ser, seres\\ \hline VSERPS & Verb
10530 \emph{to be}: personal forms & & \\ & in indicative& Yes& fosen, es\\
10531 \hline \hline \multicolumn{4}{c}{\bf{Compound tags}} \\ \hline \hline
10532 PREPDETM &
10533 Contraction of preposition and & & \\ & masculine determiner & Yes&
10534 do, ao\\ \hline PREPDETF & Contraction of preposition and & & \\ &
10535 feminine determiner & Yes& da, ás\\ \hline PREPDETN & Contraction of
10536 preposition and & & \\ & neuter determiner & Yes& do\\ \hline
10537 PREPDETDET & Contraction of preposition and & & \\ & two determiners &
10538 Yes& destoutro\\ \hline PREPPRTNNT & Contraction of preposition and &
10539 & \\ &neuter tonic pronoun & Yes& daquilo\\ \hline PREPPRNTN &
10540 Contraction of preposition and & & \\ &tonic pronoun & Yes&
10541 daqueloutra\\ \hline PREPTNTN & Contraction of preposition and & & \\
10542 & two tonic pronouns & Yes& nestoutra\\ \hline PREPNUM & Contraction
10543 of preposition and & & \\ & numeral & Yes& dunha\\ \hline PREDETDET &
10544 Contraction of predeterminer and & & \\ & determiner & Yes& tódalas\\
10545 \hline INTADVDET & Contraction of adverbial interrogative and & & \\
10546 & determiner & Yes& u-la\\ \hline DETDETM & Contraction of two masculine
10547 determiners & Yes& ámbolos\\ \hline DETDETF & Contraction of two
10548 feminine determiners& Yes& ámbalas\\ \hline PRNPRN & Contraction of
10549 two tonic pronouns & Yes& esoutra\\ \hline PRNPRN & Contraction of two
10550 proclitic pronouns & Yes& chas\\ \hline CNJCDET & Contraction of
10551 co-ordinating conjunction and & & \\ & determiner & Yes& maila\\
10552 \hline CNJSUB & Contraction of subordinating conjunction and & &
10553 \\ & determiner & Yes& cás\\ \hline \hline \multicolumn{4}{l}{}\\
10554 %p{0.50\textwidth}
10555 \end{longtable}
10556 \end{footnotesize}
10557
10558
10559 \newpage
10560 \chapter{Abbreviations used in the text}
10561 \label{se:apendiceabrev}
10562 \begin{description}
10563 \item[ANSI] American National Standards Institute; when used
10564 informally in the expression \emph{ANSI text}, it refers to a text
10565 encoded in any of the encodings of one byte per character defined in
10566 the standard ISO-8859 \cite{Unicode}.
10567 \item[ca] ISO 639 two-letter code\footnote{See
10568   \texttt{\url{http://www.w3.org/WAI/ER/IG/ert/iso639.htm}}} for
10569   Catalan
10570 \item[DTD] Document type definition in XML
10571 \item[es] ISO 639 two-letter code for Spanish
10572 \item[eu] ISO 639 two-letter code for Basque
10573 \item[LF] Lexical form (see page~\pageref{pg:FSFL})
10574 \item[TLLF] Target language lexical form
10575 \item[SLLF] Source language lexical form
10576 \item[SF] Surface form (see page~\pageref{pg:FSFL})
10577 \item[gl] ISO 639 two-letter code for Galician
10578 \item[pt] ISO 639 two-letter code for Portuguese
10579 \item[HTML] Hypertext markup language
10580 \item[TL] Target language
10581 \item[SL] Source language
10582 \item[RTF] Rich text format
10583 \item[MT] Machine translation
10584 \item[XML] Extensible markup language
10585 \item[POS] Part of speech
10586 \nota{order alfabetically}
10587 \end{description}
10588
10589 \newpage \nota{Afegir article de l'EAMT 2005 i citar-lo}
10590 \bibliography{documentation} \bibliographystyle{plain}
10591 \addcontentsline{toc}{chapter}{Bibliografía}
10592 \end{document}
10593
10594