apertium-newpair-howto/Apertium_New_Language_Pair_HOWTO.html

   1 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
   2     "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
   3 <html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en">
   4 <head>
   5 <meta name="generator" content=
   6 "HTML Tidy for MkLinux (vers 1 September 2005), see www.w3.org" />
   7 <meta name="generator" content="Bluefish 1.0.7" />
   8 <meta http-equiv="Content-Type" content=
   9 "text/html; charset=utf-8" />
  10 <title>Apertium New Language Pair HOWTO</title>
  11
  12 <style type="text/css">
  13 /*<![CDATA[*/
  14 .comment { background-color : yellow ; margin-left : 10em}
  15 /*]]>*/
  16 </style>
  17 </head>
  18 <body>
  19 <h2>Apertium New Language Pair HOWTO</h2>
  20 <p>This HOWTO document will describe how to start a new language
  21 pair for the Apertium machine translation system from scratch.</p>
  22 <p>It does not assume any knowledge of linguistics, or machine
  23 translation above the level of being able to distinguish nouns from
  24 verbs (and prepositions etc.)</p>
  25 <div class="comment">In one of my papers I have: "As a result of
  26 this, and of the intuitive approach to machine translation used in
  27 Apertium, the amount of linguistic knowledge necessary about the
  28 source and target language to build data for Apertium is kept to a
  29 minimum, and it may be easily learned on top of basic high-school
  30 grammar skills such as: morphological analysis of words:
  31 parts-of-speech or lexical categories (noun, verb, preposition,
  32 etc.) and basic morphology (number, gender, case, person, etc.;
  33 agreement (such as gender and number agreement between nouns and
  34 their modifiers: adjectives, determiners, etc.); main local
  35 structural differences between the source and target language:
  36 position of adjectives with respect to nouns (e.g, adjective after
  37 noun in Spanish, before noun in English), prepositional regime,
  38 etc." You may want to integrate this in the text somehow.</div>
  39 <h3>Authors</h3>
  40 <ul>
  41 <li>Francis Tyers</li>
  42 </ul>
  43 <h3>Licence</h3>
  44 <p>Copyright (c) 2007 Francis Tyers<br />
  45 Permission is granted to copy, distribute and/or modify this
  46 document under the terms of the GNU Free Documentation License,
  47 Version 1.2 or any later version published by the Free Software
  48 Foundation; with no Invariant Sections, no Front-Cover Texts, and
  49 no Back-Cover Texts. A copy of the license is included in the
  50 section entitled "<a href=
  51 "http://www.gnu.org/licenses/fdl.html">GNU Free Documentation
  52 License</a>".</p>
  53 <p>Any extracts of code are released under the GNU GPL, Version
  54 2.0</p>
  55 <h3>Introduction</h3>
  56 <p>Apertium is, as you've probably realised by now, a machine
  57 translation system. Well, not quite, its a machine translation
  58 platform. It provides and engine and toolbox which allow you to
  59 build your own machine translation systems. The <em>only</em> thing
  60 you need to do is write the data. The data consists on a basic
  61 level, of three dictionaries and a few rules (to deal word
  62 re-ordering and other grammatical stuff).</p>
  63 <p>For a more detailed introduction into how it all works, there
  64 are some excellent papers on the project's website <a href=
  65 "http://apertium.sourceforge.net/">apertium.sourceforge.net</a>.</p>
  66 <h3>You will need</h3>
  67 <ul>
  68 <li><code>lttoolbox</code></li>
  69 <li><code>libxml</code> utils (<code>xmllint</code> etc.)</li>
  70 <li><code>apertium</code></li>
  71 <li>a text editor (or a specialized XML editor if you prefer
  72 to)</li>
  73 </ul>
  74 <p>This document will not describe how to install these packages,
  75 for more information on that, please see the documentation section
  76 of the Apertium website.</p>
  77 <h3>What does a language pair consist of?</h3>
  78 <p>The Apertium machine translation system is of the
  79 shallow-transfer type, this basically means it works on
  80 dictionaries and shallow transfer rules. Shallow transfer
  81 is distinguished from "deep transfer" in that it doesn't do
  82 full syntactic parsing, the rules are typically operations on
  83 groups of lexical units, rather than operations on parse trees.
  84 On a basic level, there are three main dictionaries:</p>
  85 <!--
  86 <div class="comment">Perhaps it would be a good idea to try to
  87 explain what "deep transfer" would mean (full syntactic parsing with rules specifying
  88 operations on parse [sub]trees, etc.)
  89 so that the readers get an idea of what "shallow" would mean.</div>
  90
  91   :: I added something above, but it is difficult (for me) to
  92      explain without using terminology that the readers would
  93      be unfamiliar with. If it can be refined great, if not
  94      I'm not sure... ~fran
  95 -->
  96
  97 <li><b>The morphological dictionary for language <em>xx</em></b>:
  98 this contains the rules of how words in language <em>xx</em> are
  99 inflected. In our example this will be called:
 100 <code>apertium-sh-en.sh.dix</code></li>
 101 <li><b>The morphological dictionary for language <em>yy</em></b>:
 102 this contains the rules of how words in language <em>yy</em> are
 103 inflected. In our example this will be called:
 104 <code>apertium-sh-en.en.dix</code></li>
 105 <li><b>Bilingual dictionary</b>: contains correspondences between
 106 words and symbols in the two languages. In our example this will be
 107 called: <code>apertium-sh-en.sh-en.dix</code></li>
 108 </ul>
 109 <p>In a translation pair, both languages can be either source or
 110 target for translation, these are relative terms.
 111 <!--
 112 <div class="comment">Not that <em>xx</em> and <em>yy</em> are being used, why cannot
 113 we use these instead of sh and en (for the initial examples discussion)? Also,
 114 I would advocate for using <em>source</em> and <em>target</em> as <em>relative</em>
 115 terms; this was the initial motivation for using <em>xx</em> and <em>yy</em>.</div>
 116
 117   :: About the filenames: I wanted to give a concrete example. Regarding
 118      the source/target, you're right, I've removed the 'typically we will
 119      use' part. ~fran
 120 -->
 121
 122 <p>There are also two files for transfer rules. These are the rules
 123 which govern how words are re-ordered in sentences, e.g. <cite>chat
 124 noir</cite> -&gt; <cite>cat black</cite> -&gt; <cite>black
 125 cat</cite>. It also governs agreement of gender, number etc. The
 126 rules can also be used to insert or delete lexical items, as will
 127 be described later. These files are:</p>
 128 <ul>
 129 <li><b>language <em>xx</em> to language <em>yy</em> transfer
 130 rules</b>: this file contains rules for how language <em>xx</em>
 131 will be changed into language <em>yy</em>. In our example this will
 132 be: <code>apertium-sh-en.trules-sh-en.xml</code></li>
 133 <li><b>language <em>yy</em> to <em>xx</em> language transfer
 134 rules</b>: this file contains rules for how language <em>yy</em>
 135 will be changed into language <em>xx</em>. In our example this will
 136 be: <code>apertium-sh-en.trules-en-sh.xml</code></li>
 137 </ul>
 138 <!--
 139 <div class="comment">Perhaps one should note that this
 140 is Apertium "level 1".</div>
 141
 142    :: I need to note the version, but I think I'll re-do the
 143       examples etc. from an Apertium 2 point of view. I remember
 144       using HOWTOs before and being annoyed that they were out of
 145       date! ~fran
 146 -->
 147
 148 <p>Many of the language pairs currently available have other files,
 149 but we won't cover them here. These files are the only ones
 150 required to generate a functional system.</p>
 151 <h3>Language pair</h3>
 152 <p>As may have been alluded by the file names, this HOWTO will use
 153 the example of translating Serbo-Croatian to English to explain how
 154 to create a basic system. This is not an ideal pair, as the system
 155 works better for more closely related languages, and furthermore it
 156 does not currently support the full Serbo-Croatian alphabet, but
 157 that shouldn't present a problem for the simple examples we'll have
 158 here.</p>
 159 <!--
 160 <div class="comment">We can remove the sentence "As may have been alluded..." if we use <em>xx</em>
 161 and <em>yy</em> for the examples up to here.</div>
 162
 163     :: I reckon its alright, and good to give concrete examples,
 164        although this may get removed when i do a re-write ~fran
 165 -->
 166
 167 <h3>A brief note on terms</h3>
 168 <p>There are number of terms that will need to be understood before
 169 we continue.</p>
 170 <p>The first is <em>lemma</em>. A lemma is the citation form of a
 171 word. It is the word stripped of any grammatical information. For
 172 example, the lemma of the word <em>cats</em> is <em>cat</em>. In
 173 English nouns this will typically be the singular form of the word
 174 in question. For verbs, the lemma is the infinitive stripped of
 175 <em>to</em>, e.g. the lemma of <em>was</em> would be
 176 <em>be</em>.</p>
 177 <p>The second is <em>symbol</em>. In the context of the Apertium
 178 system, <em>symbol</em> refers to a grammatical label. The word <cite>cats</cite> is
 179 a plural noun, therefore it will have the <em>noun</em> symbol and the
 180 <em>plural</em> symbol. In the input and output of Apertium modules these are
 181 typically given between angle brackets, as follows:</p>
 182 <ul>
 183 <li><code>&lt;n&gt;</code> for noun.</li>
 184 <li><code>&lt;pl&gt</code>; for plural.</li>
 185 </ul>
 186 <!--
 187 <div class="comment">Make it clear that symbols appear in angle brackets in
 188 the input and the output of modules, but not in dictionaries?</div>
 189
 190     :: Good point, done. Feel free to rephrase. ~fran
 191 -->
 192
 193 <p>Other examples of symbols are <code>&lt;sg&gt;</code>; singular, <code>&lt;p1&gt;</code>
 194 first person, <code>&lt;pri&gt;</code> present indicative etc. When written in
 195 angle brackets, the symbols may also be referred to as <em>tags</em>. It is worth
 196 noting that in many of the currently available language pairs the symbol definitions
 197 are acronyms or contractions of words in Catalan. For example, <em>vbhaver</em>
 198 &mdash; from <em>vb</em> (verb) and <em>haver</em> ("to have" in Catalan).
 199 Symbols are defined in <code>&lt;sdef&gt;</code> tags and used in <code>&lt;s&gt;</code>
 200 tags. </p>
 201 <!--
 202 <div class="comment">Should we warn the reader that symbol names in many of the available language
 203 pair packages are in Catalan?</div>
 204
 205     :: Good idea, done. Feel free to rephrase. ~fran
 206 -->
 207
 208 <p>The third word is <em>paradigm</em>. In the context of the
 209 Apertium system, paradigm refers to a example of how a particular
 210 group of words inflect. In the morphological dictionary, lemmas
 211 (see above) are linked to paradigms which allows us to describe how
 212 a given lemma inflects without having to write out all of the
 213 endings.</p>
 214 <p>An example of the utility of this is, if we wanted to store the
 215 two adjectives <em>happy</em> and <em>lazy</em>, instead of storing
 216 two lots of the same thing:</p>
 217 <ul>
 218 <li><em>happy</em>, <em>happ</em> (<em>y</em>, <em>ier</em>,
 219 <em>iest</em>)</li>
 220 <li><em>lazy</em>, <em>laz</em> (<em>y</em>, <em>ier</em>,
 221 <em>iest</em>)</li>
 222 </ul>
 223 <p>We can simply store one, and then say "<em>lazy</em>, inflects
 224 like <em>happy</em>", or indeed "<em>shy</em> inflects like
 225 <em>happy</em>", "<em>naughty</em> inflects like <em>happy</em>",
 226 "<em>friendly</em> inflects like <em>happy</em>" etc. In this
 227 example, happy would be the paradigm, the model for how the others
 228 inflect. The precise description of how this is defined will be
 229 explained shortly. Paradigms are defined in
 230 <code>&lt;pardef&gt;</code> tags, and used in
 231 <code>&lt;par&gt;</code> tags.</p>
 232 <h3>Getting started</h3>
 233 <h4>Monolingual dictionaries</h4>
 234 <p>Lets start by making our first source language dictionary. The
 235 dictionary is an XML file. Fire up your text editor and type the
 236 following:</p>
 237 <p><tt>&lt;?xml version="1.0" encoding="ISO-8859-1"?&gt;<br />
 238 &lt;dictionary&gt;<br />
 239 <br />
 240 &lt;/dictionary&gt;</tt></p>
 241 <p>Save the file as <code>apertium-sh-en.sh.dix</code> with an
 242 ISO-8859-1 encoding. A short note on encoding: currently (as of
 243 April 2007), Apertium only supports the ISO-8859-1 single byte
 244 encoding. There is work ongoing to port it to Unicode (indeed an
 245 experimental version of <code>lttoolbox</code> with UTF-8 support
 246 is available from the SVN repository on the Apertium project
 247 site).</p>
 248 <p>Note: It is important to have your locale set up correctly when
 249 writing/reading files, you can find out your current locale setting
 250 by doing <code>echo $LANG</code> from a shell.</p>
 251 <p>So, the file so far defines that we want to start a dictionary.
 252 In order for it to be useful, we need to add some more entries, the
 253 first is an alphabet. This defines the set of letters that may be
 254 used in the dictionary, for Serbo-Croatian. Normally it would look
 255 something like the following, containing all the letters of the
 256 Serbo-Croatian alphabet:</p>
 257 <p>
 258 <tt>&lt;alphabet&gt;ABCČĆDDžĐEFGHIJKLLjMNNjOPRSŠTUVZŽabcčćddžđefghijklljmnnjoprsštuvzž&lt;/alphabet&gt;</tt></p>
 259 <p>However in our example, it will look like this:</p>
 260 <p>
 261 <tt>&lt;alphabet&gt;ABCDEFGHIJKLMNOPRSTUVZabcdefghijklmnoprstuvz&lt;/alphabet&gt;</tt></p>
 262 <p>The reason for this is that, as mentioned above,
 263 <code>lttoolbox</code> requires ISO-8859-1 encoding, and
 264 <em>Č</em>, <em>Ć</em>, <em>Dž</em>, <em>Đ</em>, <em>Lj</em>,
 265 <em>Nj</em>, <em>Š</em>, and <em><em>Ž</em></em> (along with their
 266 minuscule forms) are not found in this encoding. Some languages
 267 have got round this by choosing other characters from ISO-8859-1 to
 268 represent the missing letters, and then transliterating. For
 269 example, using the character 'ç' (c with cedilla) to represent 'ć'
 270 (c with acute accent), or using 'ð' (eth) to represent 'đ' (d with
 271 stroke). We will not be using this method, although an example of
 272 its use may be found in the Romanian-Spanish translation pair.</p>
 273 <!--
 274 <div class="comment">I don't think <em>lj</em> and <em>nj</em> pose any problem, since they are not
 275 special letters but digraphs. Are you really going to use these transliterations?</div>
 276
 277   :: lj and nj can be written as both digraphs (or, more 'properly' as
 278      paired characters, like æ - note that we won't use them  or the
 279      transliterations. ~fran
 280 -->
 281
 282 <p>Place the alphabet below the <tt>&lt;dictionary&gt;</tt>
 283 tag.</p>
 284 <p>Next we need to define some symbols. Lets start off with the
 285 simple stuff, noun (<code>n</code>) in singular (<code>sg</code>)
 286 and plural (<code>pl</code>).</p>
 287 <p><tt>&lt;sdefs&gt;<br />
 288 &nbsp;&nbsp; &lt;sdef n="n"/&gt;<br />
 289 &nbsp;&nbsp; &lt;sdef n="sg"/&gt;<br />
 290 &nbsp;&nbsp; &lt;sdef n="pl"/&gt;<br />
 291 &lt;/sdefs&gt;<br /></tt></p>
 292 <p>The symbol names do not have to be so small, in fact they could
 293 be just written our in full, but as you'll be typing them a lot, it
 294 makes sense to abbreviate.</p>
 295 <p>Unfortunately, it isn't quite so simple, nouns in Serbo-Croatian
 296 inflect for more than just number, they also inflect for gender and
 297 case. However, we'll assume for the purposes of this example that
 298 the noun is masculine and in the nominative case (a full example
 299 may be found at the end of this document).</p>
 300 <p>Next thing is to define a section for the paradigms,</p>
 301 <p><tt>&lt;pardefs&gt;<br />
 302 <br />
 303 &lt;/pardefs&gt;<br /></tt></p>
 304 <p>and a dictionary section:</p>
 305 <p><tt>&lt;section id="main" type="standard"&gt;<br />
 306 <br />
 307 &lt;/section&gt;<br /></tt></p>
 308 <p>
 309 There are two types of sections, the first is a <code>standard</code>
 310 section, which contains words, enclitics etc. The second type is an
 311 <code>inconditional</code> section which typically contains
 312 punctuation etc. We don't have an inconditional section here, although
 313 it will be demonstrated later.
 314 </p>
 315 <!--
 316 <div class="comment">Perhaps we should say that each <code>&lt;section&gt;</code> is a dictionary section withcertain properties; for instance, a <code>standard</code> section contains words that will
 317 only be segmented if the next character is out of the alphabet, whereas a <code>unconditional</code> section
 318 contains words that will be segmented regardless of the following character, etc. What is
 319 called "a section for mapping lemmas to paradigms" is simply a "dictionary section".</div>
 320
 321   :: we should include an example of 'splitting up' / i'm not
 322      quite sure how this works. ~fran
 323
 324 -->
 325 <p>So, our file should now look something like:</p>
 326 <p><tt>&lt;?xml version="1.0" encoding="ISO-8859-1"?&gt;<br />
 327 &lt;dictionary&gt;<br />
 328 &nbsp;&nbsp; &lt;sdefs&gt;<br />
 329 &nbsp;&nbsp;&nbsp;&nbsp; &lt;sdef n="n"/&gt;<br />
 330 &nbsp;&nbsp;&nbsp;&nbsp; &lt;sdef n="sg"/&gt;<br />
 331 &nbsp;&nbsp;&nbsp;&nbsp; &lt;sdef n="pl"/&gt;<br />
 332 &nbsp;&nbsp; &lt;/sdefs&gt;<br />
 333 &nbsp;&nbsp; &lt;pardefs&gt;<br />
 334 <br />
 335 &nbsp;&nbsp; &lt;/pardefs&gt;<br />
 336 &nbsp;&nbsp; &lt;section id="main" type="standard"&gt;<br />
 337 <br />
 338 &nbsp;&nbsp; &lt;/section&gt;<br />
 339 &lt;/dictionary&gt;</tt></p>
 340 <p>Now we've got the skeleton in place, we can start by adding a
 341 noun. The noun in question will be 'gramofon' (which means
 342 'gramophone' or 'record player').</p>
 343 <p>The first thing we need to do, as we have no prior paradigms, is
 344 to define a paradigm.</p>
 345 <p>Remember we're assuming masculine gender and nominative case.
 346 The singular form of the noun is 'gramofon', and the plural is
 347 'gramofoni'. So:</p>
 348 <p><tt>&lt;pardef n="gramofon__n"&gt;<br />
 349 &nbsp;&nbsp; &lt;e&gt;<br />
 350 &nbsp;&nbsp;&nbsp;&nbsp; &lt;p&gt;<br />
 351 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &lt;l/&gt;<br />
 352 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &lt;r&gt;&lt;s n="n"/&gt;&lt;s
 353 n="sg"/&gt;&lt;/r&gt;<br />
 354 &nbsp;&nbsp;&nbsp;&nbsp; &lt;/p&gt;<br />
 355 &nbsp;&nbsp; &lt;/e&gt;<br />
 356 &nbsp;&nbsp; &lt;e&gt;<br />
 357 &nbsp;&nbsp;&nbsp;&nbsp; &lt;p&gt;<br />
 358 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &lt;l&gt;i&lt;/l&gt;<br />
 359 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &lt;r&gt;&lt;s n="n"/&gt;&lt;s
 360 n="pl"/&gt;&lt;/r&gt;<br />
 361 &nbsp;&nbsp;&nbsp;&nbsp; &lt;/p&gt;<br />
 362 &nbsp;&nbsp; &lt;/e&gt;<br />
 363 &lt;/pardef&gt;</tt></p>
 364 <p>Note: the '&lt;l/&gt;' (equivalent to &lt;l&gt;&lt;/l&gt;)
 365 denotes that there is no extra material to be added to the stem for
 366 the singular.</p>
 367 <p>This may seem like a rather verbose way of describing it, but
 368 there are reasons for it and it quickly becomes second nature.
 369 You're probably wondering what the &lt;e&gt;, &lt;p&gt;, &lt;l&gt;
 370 and &lt;r&gt; stand for. Well,</p>
 371 <ul>
 372 <li><code>e</code>, is for <em>entry</em>.</li>
 373 <li><code>p</code>, is for <em>pair</em>.</li>
 374 <li><code>l</code>, is for <em>left</em>.</li>
 375 <li><code>r</code>, is for <em>right</em>.</li>
 376 </ul>
 377 <p>Why <em>left</em> and <em>right</em>? Well, the morphological
 378 dictionaries will later be compiled into finite state machines.
 379 Compiling them left to right produces analyses from words, and from
 380 right to left produces words from analyses. For example:</p>
 381 <ul>
 382 <li><code>gramofoni</code> (left to right)
 383 <code>gramofon&lt;n&gt;&lt;pl&gt;</code> (analysis)</li>
 384 <li><code>gramofon&lt;n&gt;&lt;pl&gt;</code> (right to left)
 385 <code>gramofoni</code> (generation)</li>
 386 </ul>
 387 <p>Now we've defined a paradigm, we need to link it to its lemma,
 388 <em>gramofon</em>. We put this in the <code>section</code> that
 389 we've defined.</p>
 390 <p>The entry to put in will look like:</p>
 391 <p><tt>&lt;e lm="gramofon"&gt;&lt;i&gt;gramofon&lt;/i&gt;&lt;par
 392 n="gramofon__n"/&gt;&lt;/e&gt;</tt></p>
 393 <p>A quick run down on the abbreviations:</p>
 394 <ul>
 395 <li><code>lm</code>, is for <em>lemma</em>.</li>
 396 <li><code>i</code>, is for <em>identity</em> (the left and the
 397 right are the same).</li>
 398 <li><code>par</code>, is for <em>paradigm</em>.</li>
 399 </ul>
 400 <p>This entry states the lemma of the word, <code>gramofon</code>,
 401 the root, <code>gramofon</code> and the paradigm with which it
 402 inflects <code>gramofon__n</code>. The difference between the lemma
 403 and the root is that the lemma is the citation form of the word,
 404 while the root is the substring of the lemma to which stems are
 405 added. This will become clearer later when we show an entry where
 406 the two are different.</p>
 407 <p>We're now ready to test the dictionary. Save it, and then return
 408 to the shell. We first need to compile it (with
 409 <code>lt-comp</code>), then we can test it (with
 410 <code>lt-proc</code>).</p>
 411 <pre>
 412 $ lt-comp lr apertium-sh-en.sh.dix sh-en.automorf.bin
 413 </pre>
 414 <p>Should produce the output:</p>
 415 <pre>
 416 main@standard 12 12
 417 </pre>
 418 <p>As we are compiling it left to right, we're producing an
 419 <em>analyser</em>. Lets make a <em>generator</em> too.</p>
 420 <pre>
 421 $ lt-comp rl apertium-sh-en.sh.dix sh-en.autogen.bin
 422 </pre>
 423 <p>At this stage, the command should produce the same output.</p>
 424 <!--
 425 <div class="comment">Not necessarily true for the examples below in which you
 426 equate <em>gramofoni</em> (nominative) and <em>gramophone</em> (accusative); when generating, you should decide which one...
 427 </div>
 428
 429   :: at the moment we're not dealing with generating serbo-croatian, and
 430      the 'output' refers to the 'main@standard 12 12' part. I've clarified this
 431      somewhat ~fran
 432 -->
 433 <p>We can now test these. Run <code>lt-proc</code> on the
 434 analyser.</p>
 435 <pre>
 436 $ lt-proc sh-en.automorf.bin
 437 </pre>
 438 <p>Now try it out, type in <code>gramofoni</code>
 439 (<em>gramophones</em>), and see the output:</p>
 440 <pre>
 441 ^gramofoni/gramofon&lt;n&gt;&lt;pl&gt;$
 442 </pre>
 443 <p>Now, for the English dictionary, do the same thing, but
 444 substitute the English word <em>gramophone</em> for <em>gramofon</em>, and
 445 change the plural inflection. What if you want to use the more correct
 446 word 'record player', well, we'll explain how to do that later.</p>
 447 <!--
 448 <div class="comment">Check</div>
 449
 450   :: I added that they should change the plural inflection (i -> s),
 451      what this what the check referred to ? ~fran
 452 -->
 453 <p>You should now have two files in the directory:</p>
 454 <ul>
 455 <li><code>apertium-sh-en.sh.dix</code> which contains a (very)
 456 basic Serbo-Croatian morphological dictionary, and</li>
 457 <li><code>apertium-sh-en.en.dix</code> which contains a (very)
 458 basic English morphological dictionary.</li>
 459 </ul>
 460 <h4>Bilingual dictionary</h4>
 461 <p>So we now have two morphological dictionaries, next thing to
 462 make is the bilingual dictionary. This describes mappings between
 463 words. All dictionaries use the same format (which is specified in
 464 the DTD, <code>dix.dtd</code>).</p>
 465 <p>Create a new file, <code>apertium-sh-en.sh-en.dix</code> and add
 466 the basic skeleton:</p>
 467 <p><tt>&lt;?xml version="1.0" encoding="ISO-8859-1"?&gt;<br />
 468 &lt;dictionary&gt;<br />
 469 &nbsp;&nbsp; &lt;alphabet/&gt;<br />
 470 &nbsp;&nbsp; &lt;sdefs&gt;<br />
 471 &nbsp;&nbsp;&nbsp;&nbsp; &lt;sdef n="n"/&gt;<br />
 472 &nbsp;&nbsp;&nbsp;&nbsp; &lt;sdef n="sg"/&gt;<br />
 473 &nbsp;&nbsp;&nbsp;&nbsp; &lt;sdef n="pl"/&gt;<br />
 474 &nbsp;&nbsp; &lt;/sdefs&gt;<br />
 475 <br />
 476 &nbsp;&nbsp; &lt;section id="main" type="standard"&gt;<br />
 477 <br />
 478 &nbsp;&nbsp; &lt;/section&gt;<br />
 479 &lt;/dictionary&gt;</tt></p>
 480 <p>Now we need to add an entry to <em>translate</em> between the
 481 two words. Something like:</p>
 482 <p><tt>&lt;e&gt;&lt;p&gt;&lt;l&gt;gramofon&lt;s
 483 n="n"/&gt;&lt;/l&gt;&lt;r&gt;gramophone&lt;s
 484 n="n"/&gt;&lt;/r&gt;&lt;/p&gt;&lt;/e&gt;</tt></p>
 485 <!--
 486 <div class="comment">Some of this could be done with
 487 &lt;i&gt;</div>
 488
 489   :: How do you mean? ~fran
 490 -->
 491 <p>Because there are a lot of these entries, they're typically
 492 written on one line to facilitate easier reading of the file. Again
 493 with the '<code>l</code>' and '<code>r</code>' right? Well, we
 494 compile it left to right to produce the Serbo-Croatian → English
 495 dictionary, and right to left to produce the English →
 496 Serbo-Croatian dictionary.</p>
 497 <p>So, once this is done, run the following commands:</p>
 498 <pre>
 499 $ lt-comp lr apertium-sh-en.sh.dix sh-en.automorf.bin
 500 $ lt-comp rl apertium-sh-en.sh.dix sh-en.autogen.bin
 501
 502 $ lt-comp lr apertium-sh-en.en.dix en-sh.automorf.bin
 503 $ lt-comp rl apertium-sh-en.en.dix en-sh.autogen.bin
 504
 505 $ lt-comp lr apertium-sh-en.sh-en.dix sh-en.autobil.bin
 506 $ lt-comp rl apertium-sh-en.sh-en.dix en-sh.autobil.bin
 507 </pre>
 508 <p>To generate the morphological analysers (<code>automorf</code>),
 509 the morphological generators (<code>autogen</code>) and the word
 510 lookups (<code>autobil</code>), the <em>bil</em> is for "bilingual".</p>
 511 <!--
 512 <div class="comment">These three names come from interNOSTRUM but I
 513 am not sure I like them much... unfortunately, they are used all
 514 over...</div>
 515
 516    :: Aye, they're a bit cryptic, but a lot of the stuff is,
 517       when you know what they mean, or where they come from
 518       it makes more sense. ~fran
 519 -->
 520 <h3>Transfer rules</h3>
 521 <p>So, now we have two morphological dictionaries, and a bilingual
 522 dictionary. All that we need now is a transfer rule for nouns.
 523 Transfer rule files have their own DTD (<code>transfer.dtd</code>)
 524 which can be found in the Apertium package. If you need to
 525 implement a rule it is often a good idea to look in the rule files
 526 of other language pairs first. Many rules can be recycled/reused
 527 between languages. For example the one described below would be
 528 useful for any null-subject language.</p>
 529 <p>Start out like all the others with a basic skeleton:</p>
 530 <p><tt>&lt;?xml version="1.0" encoding="ISO-8859-1"?&gt;<br />
 531 &lt;transfer&gt;<br />
 532 <br />
 533 &lt;/transfer&gt;<br /></tt></p>
 534 <p>At the moment, because we're ignoring case, we just need to make
 535 a rule that takes the grammatical symbols input and outputs them
 536 again.</p>
 537 <p>We first need to define categories and attributes. Categories
 538 and attributes both allow us to group grammatical symbols.
 539 Categories allow us to group symbols for the purposes of matching
 540 (for example '<code>n.*</code>' is all nouns). Attributes allow us
 541 to group a set of symbols that can be chosen from. For example
 542 ('<code>sg</code>' and '<code>pl</code>' may be grouped a an
 543 attribute '<code>number</code>').</p>
 544 <p>Lets add the necessary sections:</p>
 545 <p><tt>&lt;section-def-cats&gt;<br />
 546 <br />
 547 &lt;/section-def-cats&gt;<br />
 548 &lt;section-def-attrs&gt;<br />
 549 <br />
 550 &lt;/section-def-attrs&gt;</tt></p>
 551 <p>As we're only inflecting, nouns in singular and plural then we
 552 need to add a category for nouns, and with an attribute of number.
 553 Something like the following will suffice:</p>
 554 <p>Into <code>section-def-cats</code> add:</p>
 555 <p><tt>&lt;def-cat n="nom"&gt;<br />
 556 &nbsp;&nbsp; &lt;cat-item tags="n.*"/&gt;<br />
 557 &lt;/def-cat&gt;</tt></p>
 558 <p>This catches all nouns (lemmas followed by &lt;n&gt; then
 559 anything) and refers to them as "<code>nom</code>" (we'll see how
 560 thats used later). </p>
 561 <!--
 562 <div class="comment">Catalan names again</div>
 563
 564     :: I think its fine for now, we note that the
 565        names are fairly arbitrary. ~fran
 566 -->
 567 <p>Into the section <code>section-def-attr</code>s, add:</p>
 568 <p><tt>&lt;def-attr n="nbr"&gt;<br />
 569 &nbsp;&nbsp; &lt;attr-item tags="sg"/&gt;<br />
 570 &nbsp;&nbsp; &lt;attr-item tags="pl"/&gt;<br />
 571 &lt;/def-attr&gt;</tt></p>
 572 <p>and then</p>
 573 <p><tt>&lt;def-attr n="a_nom"&gt;<br />
 574 &nbsp;&nbsp; &lt;attr-item tags="n"/&gt;<br />
 575 &lt;/def-attr&gt;</tt></p>
 576 <p>The first defines the attribute <code>nbr</code> (number), which
 577 can be either singular (<code>sg</code>) or plural
 578 (<code>pl</code>).</p>
 579 <p>The second defines the attribute <code>a_nom</code> (attribute
 580 <em>noun</em>).</p>
 581 <p>Next we need to add a section for global variables:</p>
 582 <p><tt>&lt;section-def-vars&gt;<br />
 583 <br />
 584 &lt;/section-def-vars&gt;</tt></p>
 585 <p>These variables are used to store or transfer attributes
 586 between rules. We need only one for now,</p>
 587 <pre>
 588 &lt;def-var n="number"/&gt;
 589 </pre>
 590 <!--
 591 <div class="comment">Perhaps we should be more specific and say
 592 that these are <em>state</em> variables which may be used to
 593 propagate information computed in a certain application of a rule,
 594 to later applications of the same or other rules.</div>
 595
 596     :: Hmm, I think this is getting a bit complicated. Although
 597        it would be good to give a concrete example of where
 598        they are used. But I haven't used these yet. I'll dig
 599        around in the manual. ~fran
 600 -->
 601 <p>Finally, we need to add a rule, to take in the noun and then
 602 output it in the correct form. We'll need a rules section...</p>
 603 <p><tt>&lt;section-rules&gt;<br />
 604 <br />
 605 &lt;/section-rules&gt;</tt></p>
 606 <p>Changing the pace from the previous examples, I'll just paste
 607 this rule, then go through it, rather than the other way round.</p>
 608 <p><tt>&lt;rule&gt;<br />
 609 &nbsp;&nbsp; &lt;pattern&gt;<br />
 610 &nbsp;&nbsp;&nbsp;&nbsp; &lt;pattern-item n="nom"/&gt;<br />
 611 &nbsp;&nbsp; &lt;/pattern&gt;<br />
 612 &nbsp;&nbsp; &lt;action&gt;<br />
 613 &nbsp;&nbsp;&nbsp;&nbsp; &lt;out&gt;<br />
 614 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &lt;lu&gt;<br />
 615 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &lt;clip pos="1"
 616 side="tl" part="lem"/&gt;<br />
 617 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &lt;clip pos="1"
 618 side="tl" part="a_nom"/&gt;<br />
 619 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &lt;clip pos="1"
 620 side="tl" part="nbr"/&gt;<br />
 621 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &lt;/lu&gt;<br />
 622 &nbsp;&nbsp;&nbsp;&nbsp; &lt;/out&gt;<br />
 623 &nbsp;&nbsp; &lt;/action&gt;<br />
 624 &lt;/rule&gt;</tt></p>
 625 <div class="comment">Many people are used to pattern–action
 626 languages such as AWK, perl or lex. How about drawing an analogy
 627 here?</div>
 628 <p>The first tag is obvious, it defines a rule. The second tag,
 629 <code>pattern</code> basically says: "apply this rule, if this
 630 pattern is found". In this example the pattern consists of a single
 631 noun (defined by the category item <code>nom</code>). Note that
 632 patterns are matched in a longest-match first. So if you have three
 633 rules, the first catches "&lt;prn&gt;&lt;vblex&gt;&lt;n&gt;", the
 634 second catches "&lt;prn&gt;&lt;vblex&gt;" and the third catches
 635 "&lt;n&gt;", the pattern matched, and rule executed will be the
 636 first.</p>
 637 <!--
 638 <div class="comment">I would rather use an example in which rules
 639 match whole lexical units, because that is the way the transfer
 640 works. One cannot write patterns matching parts of lexical units,
 641 as far as I know</div>
 642
 643     :: Not sure I understand, could you re-write this part? ~fran
 644 -->
 645 <p>For each pattern, there is an associated action, which produces
 646 an associated output, <code>out</code>. The output, is a lexical
 647 unit (<code>lu</code>).</p>
 648 <p>The <code>clip</code> tag allows a user to select and manipulate
 649 attributes and parts of the source language
 650 (<code>side="sl"</code>), or target language
 651 (<code>side="tl"</code>) lexical item.</p>
 652 <p>Let's compile it and test it. Transfer rules are compiled
 653 with:</p>
 654 <pre>
 655 $ apertium-preprocess-transfer apertium-sh-en.trules-sh-en.xml trules-sh-en.bin
 656 </pre>
 657 <p>Which will generate a trules-sh-en.bin file.</p>
 658 <p>Now we're ready to test our machine translation system. There is
 659 one crucial part missing, the part-of-speech (PoS) tagger, but that
 660 will be explained shortly. In the meantime we can test it as
 661 is:</p>
 662 <p>First, lets analyse a word, <em>gramofoni</em>:</p>
 663 <pre>
 664 $ echo "gramofoni" | lt-proc sh-en.automorf.bin
 665 ^gramofon/gramofon&lt;n&gt;&lt;pl&gt;$
 666 </pre>
 667 <p>Now, normally here the POS tagger would choose the right version
 668 based on the part of speech, but we don't have a POS tagger yet, so
 669 we can use this little perl script that will just output the first
 670 item retrieved.</p>
 671 <!--
 672 <div class="comment">I am not sure the reader would understand what
 673 <em>version</em> means here unless we have an example in which the
 674 morphological analyser derives more than one lexical form for a
 675 given source form... Perhaps we should explain that, and give an
 676 example...</div>
 677
 678     :: Good idea. I'll try and work an example in. ~fran
 679
 680 -->
 681 <pre>
 682 $ echo "gramofoni" | lt-proc sh-en.automorf.bin | \
 683   perl -ne 's,^([^/]*/)(.*)$,\^\2,; s,^(.*\$\s\^)[^/]+/(.*)$,\1\2,; print' | \
 684 ^gramofon&lt;n&gt;&lt;pl&gt;$
 685 </pre>
 686 <!--
 687 <div class="comment">We are writing the pipelines ourselves, but
 688 Apertium uses <code>modes.xml</code> file to control this; perhaps
 689 a later version of this howto can explain how to write
 690 <code>modes.xml</code> files... if you look at
 691 <code>apertium-en-ca</code>, you will see that the
 692 <code>modes.xml</code> file contains many modes, to use parts of
 693 the pipeline for diagnostics and testsing.</div>
 694
 695     :: I'll add something on modes.xml at the end. ~fran
 696
 697 -->
 698 <p>Now let's process that with the transfer rule:</p>
 699 <pre>
 700 $ echo "gramofoni" | lt-proc sh-en.automorf.bin | \
 701   perl -ne 's,^([^/]*/)(.*)$,\^\2,; s,^(.*\$\s\^)[^/]+/(.*)$,\1\2,; print' | \
 702   apertium-transfer apertium-sh-en.trules-sh-en.xml trules-sh-en.bin sh-en.autobil.bin
 703 </pre>
 704 <p>It will output:</p>
 705 <pre>
 706 ^gramophone&lt;n&gt;&lt;pl&gt;$^@
 707 </pre>
 708 <!--
 709 <div class="comment">What's the "^@"?</div>
 710
 711     :: I think that its because we don't have any
 712        punctuation specified. I'll check it and
 713        put a note in. ~fran
 714
 715 -->
 716 <ul>
 717 <li>'gramophone' is the target language (<code>side="tl"</code>)
 718 lemma (<code>lem</code>) at position 1 (<code>pos="1"</code>).</li>
 719 <li>'&lt;n&gt;' is the target language <code>a_nom</code> at
 720 position 1.</li>
 721 <li>'&lt;pl&gt;' is the target language attribute of
 722 <em>number</em> (<code>nbr</code>) at position 1.</li>
 723 </ul>
 724 <p>Try commenting out one of these clip statements, recompiling and
 725 seeing what happens.</p>
 726 <p>So, now we have the output from the transfer, the only thing
 727 that remains is to generate the target-language inflected forms.
 728 For this, we use <code>lt-proc</code>, but in generation
 729 (<code>-g</code>), not analysis mode.</p>
 730 <pre>
 731 $ echo "gramofoni" | lt-proc sh-en.automorf.bin | \
 732   perl -ne 's,^([^/]*/)(.*)$,\^\2,; s,^(.*\$\s\^)[^/]+/(.*)$,\1\2,; print' | \
 733   apertium-transfer apertium-sh-en.trules-sh-en.xml trules-sh-en.bin sh-en.autobil.bin | \
 734   lt-proc -g sh-en.autogen.bin
 735
 736 gramophones\@
 737 </pre>
 738 <!--
 739 <div class="comment">What is the "\@"?</div>
 740
 741     :: I think it is as a result of not having any
 742        punctuation defined. ~fran
 743 -->
 744 <p>And c'est ca. You now have a machine translation system that
 745 translates a Serbo-Croatian noun into an English noun. Obviously
 746 this isn't very useful, but we'll get onto the more complex stuff
 747 soon. Oh, and don't worry about the '@' symbol, I'll explain that
 748 soon too.</p>
 749 <p>Think of a few other words that inflect the same as gramofon.
 750 How about adding those. We don't need to add any paradigms, just
 751 the entries in the main section of the monolingual and bilingual
 752 dictionaries.</p>
 753 <h3>Bring on the verbs</h3>
 754 <p>Ok, so we have a system that translates nouns, but thats pretty
 755 useless, we want to translate verbs too, and even whole sentences!
 756 How about we start with the verb <cite>to see</cite>. In
 757 Serbo-Croatian this is <cite>videti</cite>. Serbo-Croatian is a
 758 null-subject language, this means that it doesn't typically use
 759 personal pronouns before the conjugated form of the verb. English
 760 is not. So for example: <cite>I see</cite> in English would be
 761 translated as <cite>vidim</cite> in Serbo-Croatian.</p>
 762 <pre>
 763 *   Vidim
 764 *   see&lt;p1&gt;&lt;sg&gt;
 765 * I see
 766 </pre>
 767 <p>Note: &lt;p1&gt; denotes <em>first person</em></p>
 768 <p>This will be important when we come to write the transfer rule
 769 for verbs. Other examples of null-subject languages include:
 770 Spanish, Romanian and Polish. The also has the effect that while we
 771 only need to add the verb in the Serbo-Croatian morphological
 772 dictionary, we need to add both the verb, and the personal pronouns
 773 in the English morpohlogical dictionary. We'll go through both of
 774 these.</p>
 775 <p>The other forms of the verb <cite>videti</cite> are:
 776 <cite>vidiš</cite>, <cite>vidi</cite>, <cite>vidimo</cite>,
 777 <cite>vidite</cite>, and <cite>vide</cite>; which correspond to:
 778 <cite>you see</cite> (singular), <cite>he sees</cite>, <cite>we
 779 see</cite>, <cite>you see</cite> (plural), and <cite>they
 780 see</cite>.</p>
 781 <p>There are two forms of <cite>you see</cite>, one is plural and
 782 formal singular (<cite>vidite</cite>) and the other is singular and
 783 informal (<cite>vidiš</cite>).</p>
 784 <p>We're going to try and translate the sentence: <cite>Vidim
 785 gramofoni</cite> into <cite>I see gramophones</cite>. In the
 786 interests of space, we'll just add enough information to do the
 787 translation and will leave filling out the paradigms (adding the
 788 other conjugations of the verb) as an exercise to the reader.</p>
 789 <p>The astute reader will have realised by this point that we can't
 790 just translate <cite>vidim gramofoni</cite> because it is not a
 791 grammatically correct sentence in Serbo-Croatian. The correct
 792 sentence would be <cite>vidim gramofone</cite>, as the noun takes
 793 the accusative case. We'll have to add that form too, no need to
 794 add the case information for now though, we just add it as another
 795 option for plural. So, just copy the '<code>e</code>' block for
 796 '<code>i</code>' and change the '<code>i</code>' to
 797 '<code>e</code>' there.</p>
 798 <!--
 799 <div class="comment">Mikel's comments so far end here.</div>
 800 -->
 801 <p>First thing we need to do is add some more symbols. We need to
 802 first add a symbol for 'verb', which we'll call "vblex" (this means
 803 lexical verb, as opposed to modal verbs and other types). Verbs
 804 have 'person', and 'tense' along with number, so lets add a couple
 805 of those aswell. We need to translate "I see", so for person we
 806 should add "p1", or 'first person', and for tense "pri", or
 807 'present indicative'.</p>
 808 <pre>
 809 &lt;sdef n="vblex"/&gt;
 810 &lt;sdef n="p1"/&gt;
 811 &lt;sdef n="pri"/&gt;
 812 </pre>
 813 <p>After we've done this, the same with the nouns, we add a
 814 paradigm for the verb conjugation. The first line will be:</p>
 815 <p><tt>&lt;pardef n="vid/eti__vblex"&gt;</tt></p>
 816 <p>The '/' is used to demarcate where the stems (the parts between
 817 the &lt;l&gt; &lt;/l&gt; tags) are added to.</p>
 818 <p>Then the inflection for first person singular:</p>
 819 <p><tt>&lt;e&gt;<br />
 820 &nbsp;&nbsp; &lt;p&gt;<br />
 821 &nbsp;&nbsp;&nbsp;&nbsp; &lt;l&gt;im&lt;/l&gt;<br />
 822 &nbsp;&nbsp;&nbsp;&nbsp; &lt;r&gt;eti&lt;s n="vblex"/&gt;&lt;s
 823 n="pri"/&gt;&lt;s n="p1"/&gt;&lt;s n="sg"/&gt;&lt;/r&gt;<br />
 824 &nbsp;&nbsp; &lt;/p&gt;<br />
 825 &lt;/e&gt;</tt></p>
 826 <p>The 'im' denotes the ending (as in 'vidim'), it is necessary to
 827 add 'eti' to the &lt;r&gt; section, as this will be chopped off by
 828 the definition. The rest is fairly straightforward, 'vblex' is
 829 lexical verb, 'pri' is present indicative tense, 'p1' is first
 830 person and 'sg' is singular. We can also add the plural which will
 831 be the same, except 'imo' instead of 'im' and 'pl' instead of
 832 'sg'.</p>
 833 <p>After this we need to add a lemma, paradigm mapping to the main
 834 section:</p>
 835 <p><tt>&lt;e lm="videti"&gt;&lt;i&gt;vid&lt;/i&gt;&lt;par
 836 n="vid/eti__vblex"/&gt;&lt;/e&gt;</tt></p>
 837 <p>Note: the content of &lt;i&gt; &lt;/i&gt; is the root, not the
 838 lemma.</p>
 839 <p>Thats the work on the Serbo-Croatian dictionary done for now.
 840 Lets compile it then test it.</p>
 841 <pre>
 842 $ lt-comp lr apertium-sh-en.sh.dix sh-en.automorf.bin
 843 main@standard 23 25
 844 $ echo "vidim" | lt-proc sh-en.automorf.bin
 845 ^vidim/videti&lt;vblex&gt;&lt;pri&gt;&lt;p1&gt;&lt;sg&gt;$
 846 $ echo "vidimo" | lt-proc sh-en.automorf.bin
 847 ^vidimo/videti&lt;vblex&gt;&lt;pri&gt;&lt;p1&gt;&lt;pl&gt;$
 848 </pre>
 849 <p>Ok, so now we do the same for the English dictionary (remember
 850 to add the same symbol definitions here as you added to the
 851 Serbo-Croatian one).</p>
 852 <p>The paradigm is:</p>
 853 <p><tt>&lt;pardef n="s/ee__vblex"&gt;</tt></p>
 854 <p>because the past tense is 'saw'. Now, we can do one of two
 855 things, we can add both first and second person, but they are the
 856 same form. In fact, all forms (except third person singular) of the
 857 verb 'to see' are 'see'. So instead we make one entry for 'see' and
 858 give it only the 'pri' symbol.</p>
 859 <p><tt>&lt;e&gt;<br />
 860 &nbsp;&nbsp; &lt;p&gt;<br />
 861 &nbsp;&nbsp;&nbsp;&nbsp; &lt;l&gt;ee&lt;/l&gt;<br />
 862 &nbsp;&nbsp;&nbsp;&nbsp; &lt;r&gt;ee&lt;s n="vblex"/&gt;&lt;s
 863 n="pri"/&gt;&lt;/r&gt;<br />
 864 &nbsp;&nbsp; &lt;/p&gt;<br />
 865 &lt;/e&gt;</tt></p>
 866 <p>and as always, an entry in the main section:</p>
 867 <p><tt>&lt;e lm="see"&gt;&lt;i&gt;s&lt;/i&gt;&lt;par
 868 n="s/ee__vblex"/&gt;&lt;/e&gt;</tt></p>
 869 <p>Then lets save, recompile and test:</p>
 870 <pre>
 871 $ lt-comp lr apertium-sh-en.en.dix en-sh.automorf.bin
 872 main@standard 18 19
 873
 874 $ echo "see" | lt-proc en-sh.automorf.bin
 875 ^see/see&lt;vblex&gt;&lt;pri&gt;$
 876 </pre>
 877 <p>Now for the obligatory entry in the bilingual dictionary:</p>
 878 <pre>
 879 &lt;e&gt;&lt;p&gt;&lt;l&gt;videti&lt;s n="vblex"/&gt;&lt;/l&gt;&lt;r&gt;see&lt;s n="vblex"/&gt;&lt;/r&gt;&lt;/p&gt;&lt;/e&gt;
 880 </pre>
 881 <p>(again, don't forget to add the sdefs from earlier)</p>
 882 <p>And recompile:</p>
 883 <pre>
 884 $ lt-comp lr apertium-sh-en.sh-en.dix sh-en.autobil.bin
 885 main@standard 18 18
 886 $ lt-comp rl apertium-sh-en.sh-en.dix en-sh.autobil.bin
 887 main@standard 18 18
 888 </pre>
 889 <p>Now to test:</p>
 890 <pre>
 891 $ echo "vidim" | lt-proc sh-en.automorf.bin | \
 892   perl -ne 's,^([^/]*/)(.*)$,\^\2,; s,^(.*\$\s\^)[^/]+/(.*)$,\1\2,; print' | \
 893   apertium-transfer apertium-sh-en.trules-sh-en.xml trules-sh-en.bin sh-en.autobil.bin
 894
 895 ^see&lt;vblex&gt;&lt;pri&gt;&lt;p1&gt;&lt;sg&gt;$^@
 896 </pre>
 897 <p>We get the analysis passed through correctly, but when we try
 898 and generate a surface form from this, we get a '#', like
 899 below:</p>
 900 <pre>
 901 $ echo "vidim" | lt-proc sh-en.automorf.bin | \
 902   perl -ne 's,^([^/]*/)(.*)$,\^\2,; s,^(.*\$\s\^)[^/]+/(.*)$,\1\2,; print' | \
 903   apertium-transfer apertium-sh-en.trules-sh-en.xml trules-sh-en.bin sh-en.autobil.bin | \
 904   lt-proc -g sh-en.autogen.bin
 905 #see\@
 906 </pre>
 907 <p>This '#' means that the generator cannot generate the correct
 908 lexical form because it does not contain it. Why is this?</p>
 909 <p>Basically the analyses don't match, the 'see' in the dictionary
 910 is see&lt;vblex&gt;&lt;pri&gt;, but the see delivered by the
 911 transfer is see&lt;vblex&gt;&lt;pri&gt;&lt;p1&gt;&lt;sg&gt;. The
 912 Serbo-Croatian side has more information than the English side
 913 requires. You can test this by adding the missing symbols to the
 914 English dictionary, and then recompiling, and testing again.</p>
 915 <p>However, a more paradigmatic way of taking care of this is by
 916 writing a rule. So, we open up the rules file
 917 (apertium-sh-en.trules-sh-en.xml in case you forgot).</p>
 918 <p>We need to add a new category for 'verb'.</p>
 919 <p><tt>&lt;def-cat n="vrb"&gt;<br />
 920 &nbsp;&nbsp; &lt;cat-item tags="vblex.*"/&gt;<br />
 921 &lt;/def-cat&gt;</tt></p>
 922 <p>We also need to add attributes for tense and for person. We'll
 923 make it really simple for now, you can add p2 and p3, but I won't
 924 in order to save space.</p>
 925 <p><tt>&lt;def-attr n="temps"&gt;<br />
 926 &nbsp;&nbsp; &lt;attr-item tags="pri"/&gt;<br />
 927 &lt;/def-attr&gt;<br />
 928 <br />
 929 &lt;def-attr n="pers"&gt;<br />
 930 &nbsp;&nbsp; &lt;attr-item tags="p1"/&gt;<br />
 931 &lt;/def-attr&gt;</tt></p>
 932 <p>We should also add an attribute for verbs.</p>
 933 <p><tt>&lt;def-attr n="a_verb"&gt;<br />
 934 &nbsp;&nbsp; &lt;attr-item tags="vblex"/&gt;<br />
 935 &lt;/def-attr&gt;</tt></p>
 936 <p>Now onto the rule:</p>
 937 <p><tt>&lt;rule&gt;<br />
 938 &nbsp;&nbsp; &lt;pattern&gt;<br />
 939 &nbsp;&nbsp;&nbsp;&nbsp; &lt;pattern-item n="vrb"/&gt;<br />
 940 &nbsp;&nbsp; &lt;/pattern&gt;<br />
 941 &nbsp;&nbsp; &lt;action&gt;<br />
 942 &nbsp;&nbsp;&nbsp;&nbsp; &lt;out&gt;<br />
 943 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &lt;lu&gt;<br />
 944 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &lt;clip pos="1"
 945 side="tl" part="lem"/&gt;<br />
 946 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &lt;clip pos="1"
 947 side="tl" part="a_verb"/&gt;<br />
 948 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &lt;clip pos="1"
 949 side="tl" part="temps"/&gt;<br />
 950 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &lt;/lu&gt;<br />
 951 &nbsp;&nbsp;&nbsp;&nbsp; &lt;/out&gt;<br />
 952 &nbsp;&nbsp; &lt;/action&gt;<br />
 953 &lt;/rule&gt;<br /></tt></p>
 954 <p>Remember when you tried commenting out the 'clip' tags in the
 955 previous rule example and they disappeared from the transfer, well,
 956 thats pretty much what we're doing here. We take in a verb with a
 957 full analysis, but only output a partial analysis (lemma + verb tag
 958 + tense tag).</p>
 959 <p>So now, if we recompile that, we get:</p>
 960 <pre>
 961 $ echo "vidim" | lt-proc sh-en.automorf.bin | \
 962   perl -ne 's,^([^/]*/)(.*)$,\^\2,; s,^(.*\$\s\^)[^/]+/(.*)$,\1\2,; print' | \
 963   apertium-transfer apertium-sh-en.trules-sh-en.xml trules-sh-en.bin sh-en.autobil.bin
 964 ^see&lt;vblex&gt;&lt;pri&gt;$^@
 965 </pre>
 966 <p>and:</p>
 967 <pre>
 968 $ echo "vidim" | lt-proc sh-en.automorf.bin  | \
 969   perl -ne 's,^([^/]*/)(.*)$,\^\2,; s,^(.*\$\s\^)[^/]+/(.*)$,\1\2,; print' | \
 970   apertium-transfer apertium-sh-en.trules-sh-en.xml trules-sh-en.bin sh-en.autobil.bin | \
 971   lt-proc -g sh-en.autogen.bin
 972 see\@
 973 </pre>
 974 <p>Try it with 'vidimo' (we see) to see if you get the correct
 975 output.</p>
 976 <p>Now try it with "vidim gramofone":</p>
 977 <pre>
 978 $ echo "vidim gramofoni" | lt-proc sh-en.automorf.bin | \
 979   perl -ne 's,^([^/]*/)(.*)$,\^\2,; s,^(.*\$\s\^)[^/]+/(.*)$,\1\2,; print' | \
 980   apertium-transfer apertium-sh-en.trules-sh-en.xml trules-sh-en.bin sh-en.autobil.bin | \
 981   lt-proc -g sh-en.autogen.bin
 982 see gramophones\@
 983 </pre>
 984 <h3>But what about personal pronouns?</h3>
 985 <p>Well, thats great, but we're still missing the personal pronoun
 986 that is necessary in English. In order to add it in, we first need
 987 to edit the English morphological dictionary.</p>
 988 <p>As before, the first thing to do is add the necessary
 989 symbols:</p>
 990 <p><tt>&lt;sdef n="prn"/&gt;<br />
 991 &lt;sdef n="subj"/&gt;</tt></p>
 992 <p>Of the two symbols, prn is pronoun, and subj is subject (as in
 993 the subject of a sentence).</p>
 994 <p>Because there is no root, or 'lemma' for personal subject
 995 pronouns, we just add the pardef as follows:</p>
 996 <p><tt>&lt;pardef n="prsubj__prn"&gt;<br />
 997 &nbsp;&nbsp; &lt;e&gt;<br />
 998 &nbsp;&nbsp;&nbsp;&nbsp; &lt;p&gt;<br />
 999 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &lt;l&gt;I&lt;/l&gt;<br />
1000 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &lt;r&gt;prpers&lt;s
1001 n="prn"/&gt;&lt;s n="subj"/&gt;&lt;s n="p1"/&gt;&lt;s
1002 n="sg"/&gt;&lt;/r&gt;<br />
1003 &nbsp;&nbsp;&nbsp;&nbsp; &lt;/p&gt;<br />
1004 &nbsp;&nbsp; &lt;/e&gt;<br />
1005 &lt;/pardef&gt;</tt></p>
1006 <p>With 'prsubj' being 'personal subject'. The rest of them (You,
1007 We etc.) are left as an exercise to the reader.</p>
1008 <p>We can add an entry to the main section as follows:</p>
1009 <pre>
1010 &lt;e lm="personal subject pronouns"&gt;&lt;i/&gt;&lt;par n="prsubj__prn"/&gt;&lt;/e&gt;
1011 </pre>
1012 <p>So, save, recompile and test, and we should get something
1013 like:</p>
1014 <pre>
1015 $ echo "I" | lt-proc en-sh.automorf.bin
1016 ^I/PRPERS&lt;prn&gt;&lt;subj&gt;&lt;p1&gt;&lt;sg&gt;$
1017 </pre>
1018 <p>(Note: its in capitals because 'I' is in capitals).</p>
1019 <p>Now we need to amend the 'verb' rule to output the subject
1020 personal pronoun along with the correct verb form.</p>
1021 <p>First, add a category (this must be getting pretty pedestrian by
1022 now):</p>
1023 <p><tt>&lt;def-cat n="prpers"&gt;<br />
1024 &nbsp;&nbsp; &lt;cat-item lemma="prpers" tags="prn.*"/&gt;<br />
1025 &lt;/def-cat&gt;</tt></p>
1026 <p>Now add the types of pronoun as attributes, we might as well add
1027 the 'obj' type as we're at it, although we won't need to use it for
1028 now:</p>
1029 <p><tt>&lt;def-attr n="tipus_prn"&gt;<br />
1030 &nbsp;&nbsp; &lt;attr-item tags="prn.subj"/&gt;<br />
1031 &nbsp;&nbsp; &lt;attr-item tags="prn.obj"/&gt;<br />
1032 &lt;/def-attr&gt;</tt></p>
1033 <p>And now to input the rule:</p>
1034 <p><tt>&lt;rule&gt;<br />
1035 &nbsp;&nbsp; &lt;pattern&gt;<br />
1036 &nbsp;&nbsp;&nbsp;&nbsp; &lt;pattern-item n="vrb"/&gt;<br />
1037 &nbsp;&nbsp; &lt;/pattern&gt;<br />
1038 &nbsp;&nbsp; &lt;action&gt;<br />
1039 &nbsp;&nbsp;&nbsp;&nbsp; &lt;out&gt;<br />
1040 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &lt;lu&gt;<br />
1041 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &lt;lit
1042 v="prpers"/&gt;<br />
1043 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &lt;lit-tag
1044 v="prn"/&gt;<br />
1045 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &lt;lit-tag
1046 v="subj"/&gt;<br />
1047 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &lt;clip pos="1"
1048 side="tl" part="pers"/&gt;<br />
1049 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &lt;clip pos="1"
1050 side="tl" part="nbr"/&gt;<br />
1051 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &lt;/lu&gt;<br />
1052 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &lt;b/&gt;<br />
1053 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &lt;lu&gt;<br />
1054 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &lt;clip pos="1"
1055 side="tl" part="lem"/&gt;<br />
1056 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &lt;clip pos="1"
1057 side="tl" part="a_verb"/&gt;<br />
1058 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &lt;clip pos="1"
1059 side="tl" part="temps"/&gt;<br />
1060 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &lt;/lu&gt;<br />
1061 &nbsp;&nbsp;&nbsp;&nbsp; &lt;/out&gt;<br />
1062 &nbsp;&nbsp; &lt;/action&gt;<br />
1063 &lt;/rule&gt;</tt></p>
1064 <p>This is pretty much the same rule as before, only we made a
1065 couple of small changes.</p>
1066 <p>We needed to output:</p>
1067 <p><tt>^prpers&lt;prn&gt;&lt;subj&gt;&lt;p1&gt;&lt;sg&gt;$
1068 ^see&lt;vblex&gt;&lt;pri&gt;$</tt></p>
1069 <p>so that the generator could choose the right pronoun and the
1070 right form of the verb.</p>
1071 <p>So, a quick rundown:</p>
1072 <ul>
1073 <li>&lt;lit&gt;, prints a literal string, in this case
1074 "prpers"</li>
1075 <li>&lt;lit-tag&gt;, prints a literal tag, because we can't get the
1076 tags from the verb, we add these ourself, "prn" for pronoun, and
1077 "subj" for subject.</li>
1078 <li>&lt;b/&gt;, prints a blank, a space.</li>
1079 </ul>
1080 <p>Note that we retrieved the information for number and tense
1081 directly from the verb.</p>
1082 <p>So, now if we recompile and test that again:</p>
1083 <pre>
1084 $ echo "vidim gramofone" | lt-proc sh-en.automorf.bin  | \
1085   perl -ne 's,^([^/]*/)(.*)$,\^\2,; s,^(.*\$\s\^)[^/]+/(.*)$,\1\2,; print' | \
1086   apertium-transfer apertium-sh-en.trules-sh-en.xml trules-sh-en.bin sh-en.autobil.bin | \
1087   lt-proc -g sh-en.autogen.bin
1088 I see gramophones
1089 </pre>
1090 <p>Which, while it isn't exactly prize-winning prose (much like
1091 this HOWTO), is a fairly accurate translation.</p>
1092 <h3>So tell me about the record player</h3>
1093 <p>While gramophone is an English word, it isn't the best
1094 translation. Gramophone is typically used for the very old kind,
1095 you know with the needle instead of the stylus, and no
1096 amplification. A better translation would be 'record player'.
1097 Although this is more than one word, we can treat it as if it is
1098 one word by using multiword (<i>multipalabra</i>)
1099 constructions.</p>
1100 <p>We don't need to touch the Serbo-Croatian dictionary, just the
1101 English one and the bilingual one this, so open it up.</p>
1102 <p>The plural of 'record player' is 'record players', so it takes
1103 the same paradigm as gramophone (gramophone__n) — in that we just
1104 add 's'. All we need to do is add a new element to the main
1105 section.</p>
1106 <p><tt>&lt;e lm="record
1107 player"&gt;&lt;i&gt;record&lt;b/&gt;player&lt;/i&gt;&lt;par
1108 n="gramophone__n"/&gt;&lt;/e&gt;</tt></p>
1109 <p>The only thing different about this is the use of the &lt;b/&gt;
1110 tag, although this isn't entirely new as we saw it in use in the
1111 rules file.</p>
1112 <p>So, recompile and test in the orthodox fashion:</p>
1113 <pre>
1114 $ echo "vidim gramofone" | lt-proc sh-en.automorf.bin | \
1115   perl -ne 's,^([^/]*/)(.*)$,\^\2,; s,^(.*\$\s\^)[^/]+/(.*)$,\1\2,; print' | \
1116   apertium-transfer apertium-sh-en.trules-sh-en.xml trules-sh-en.bin sh-en.autobil.bin  | \
1117   lt-proc -g sh-en.autogen.bin
1118 I see record players
1119 </pre>
1120 <p>Perfect. A big benefit of using multiwords is that you can
1121 translate idiomatic expressions verbatim, without having to do
1122 word-by-word translation.</p>
1123 </body>
1124 </html>