Opa
[apertium.git] / apertium-newpair-howto / Apertium_New_Language_Pair_HOWTO.html
blobbaf253f116841d20919c8311511428ba00df400d
1 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
2 "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
3 <html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en">
4 <head>
5 <meta name="generator" content=
6 "HTML Tidy for MkLinux (vers 1 September 2005), see www.w3.org" />
7 <meta name="generator" content="Bluefish 1.0.7" />
8 <meta http-equiv="Content-Type" content=
9 "text/html; charset=utf-8" />
10 <title>Apertium New Language Pair HOWTO</title>
12 <style type="text/css">
13 /*<![CDATA[*/
14 .comment { background-color : yellow ; margin-left : 10em}
15 /*]]>*/
16 </style>
17 </head>
18 <body>
19 <h2>Apertium New Language Pair HOWTO</h2>
20 <p>This HOWTO document will describe how to start a new language
21 pair for the Apertium machine translation system from scratch.</p>
22 <p>It does not assume any knowledge of linguistics, or machine
23 translation above the level of being able to distinguish nouns from
24 verbs (and prepositions etc.)</p>
25 <div class="comment">In one of my papers I have: "As a result of
26 this, and of the intuitive approach to machine translation used in
27 Apertium, the amount of linguistic knowledge necessary about the
28 source and target language to build data for Apertium is kept to a
29 minimum, and it may be easily learned on top of basic high-school
30 grammar skills such as: morphological analysis of words:
31 parts-of-speech or lexical categories (noun, verb, preposition,
32 etc.) and basic morphology (number, gender, case, person, etc.;
33 agreement (such as gender and number agreement between nouns and
34 their modifiers: adjectives, determiners, etc.); main local
35 structural differences between the source and target language:
36 position of adjectives with respect to nouns (e.g, adjective after
37 noun in Spanish, before noun in English), prepositional regime,
38 etc." You may want to integrate this in the text somehow.</div>
39 <h3>Authors</h3>
40 <ul>
41 <li>Francis Tyers</li>
42 </ul>
43 <h3>Licence</h3>
44 <p>Copyright (c) 2007 Francis Tyers<br />
45 Permission is granted to copy, distribute and/or modify this
46 document under the terms of the GNU Free Documentation License,
47 Version 1.2 or any later version published by the Free Software
48 Foundation; with no Invariant Sections, no Front-Cover Texts, and
49 no Back-Cover Texts. A copy of the license is included in the
50 section entitled "<a href=
51 "http://www.gnu.org/licenses/fdl.html">GNU Free Documentation
52 License</a>".</p>
53 <p>Any extracts of code are released under the GNU GPL, Version
54 2.0</p>
55 <h3>Introduction</h3>
56 <p>Apertium is, as you've probably realised by now, a machine
57 translation system. Well, not quite, its a machine translation
58 platform. It provides and engine and toolbox which allow you to
59 build your own machine translation systems. The <em>only</em> thing
60 you need to do is write the data. The data consists on a basic
61 level, of three dictionaries and a few rules (to deal word
62 re-ordering and other grammatical stuff).</p>
63 <p>For a more detailed introduction into how it all works, there
64 are some excellent papers on the project's website <a href=
65 "http://apertium.sourceforge.net/">apertium.sourceforge.net</a>.</p>
66 <h3>You will need</h3>
67 <ul>
68 <li><code>lttoolbox</code></li>
69 <li><code>libxml</code> utils (<code>xmllint</code> etc.)</li>
70 <li><code>apertium</code></li>
71 <li>a text editor (or a specialized XML editor if you prefer
72 to)</li>
73 </ul>
74 <p>This document will not describe how to install these packages,
75 for more information on that, please see the documentation section
76 of the Apertium website.</p>
77 <h3>What does a language pair consist of?</h3>
78 <p>The Apertium machine translation system is of the
79 shallow-transfer type, this basically means it works on
80 dictionaries and shallow transfer rules. Shallow transfer
81 is distinguished from "deep transfer" in that it doesn't do
82 full syntactic parsing, the rules are typically operations on
83 groups of lexical units, rather than operations on parse trees.
84 On a basic level, there are three main dictionaries:</p>
85 <!--
86 <div class="comment">Perhaps it would be a good idea to try to
87 explain what "deep transfer" would mean (full syntactic parsing with rules specifying
88 operations on parse [sub]trees, etc.)
89 so that the readers get an idea of what "shallow" would mean.</div>
91 :: I added something above, but it is difficult (for me) to
92 explain without using terminology that the readers would
93 be unfamiliar with. If it can be refined great, if not
94 I'm not sure... ~fran
95 -->
97 <li><b>The morphological dictionary for language <em>xx</em></b>:
98 this contains the rules of how words in language <em>xx</em> are
99 inflected. In our example this will be called:
100 <code>apertium-sh-en.sh.dix</code></li>
101 <li><b>The morphological dictionary for language <em>yy</em></b>:
102 this contains the rules of how words in language <em>yy</em> are
103 inflected. In our example this will be called:
104 <code>apertium-sh-en.en.dix</code></li>
105 <li><b>Bilingual dictionary</b>: contains correspondences between
106 words and symbols in the two languages. In our example this will be
107 called: <code>apertium-sh-en.sh-en.dix</code></li>
108 </ul>
109 <p>In a translation pair, both languages can be either source or
110 target for translation, these are relative terms.
111 <!--
112 <div class="comment">Not that <em>xx</em> and <em>yy</em> are being used, why cannot
113 we use these instead of sh and en (for the initial examples discussion)? Also,
114 I would advocate for using <em>source</em> and <em>target</em> as <em>relative</em>
115 terms; this was the initial motivation for using <em>xx</em> and <em>yy</em>.</div>
117 :: About the filenames: I wanted to give a concrete example. Regarding
118 the source/target, you're right, I've removed the 'typically we will
119 use' part. ~fran
122 <p>There are also two files for transfer rules. These are the rules
123 which govern how words are re-ordered in sentences, e.g. <cite>chat
124 noir</cite> -&gt; <cite>cat black</cite> -&gt; <cite>black
125 cat</cite>. It also governs agreement of gender, number etc. The
126 rules can also be used to insert or delete lexical items, as will
127 be described later. These files are:</p>
128 <ul>
129 <li><b>language <em>xx</em> to language <em>yy</em> transfer
130 rules</b>: this file contains rules for how language <em>xx</em>
131 will be changed into language <em>yy</em>. In our example this will
132 be: <code>apertium-sh-en.trules-sh-en.xml</code></li>
133 <li><b>language <em>yy</em> to <em>xx</em> language transfer
134 rules</b>: this file contains rules for how language <em>yy</em>
135 will be changed into language <em>xx</em>. In our example this will
136 be: <code>apertium-sh-en.trules-en-sh.xml</code></li>
137 </ul>
138 <!--
139 <div class="comment">Perhaps one should note that this
140 is Apertium "level 1".</div>
142 :: I need to note the version, but I think I'll re-do the
143 examples etc. from an Apertium 2 point of view. I remember
144 using HOWTOs before and being annoyed that they were out of
145 date! ~fran
148 <p>Many of the language pairs currently available have other files,
149 but we won't cover them here. These files are the only ones
150 required to generate a functional system.</p>
151 <h3>Language pair</h3>
152 <p>As may have been alluded by the file names, this HOWTO will use
153 the example of translating Serbo-Croatian to English to explain how
154 to create a basic system. This is not an ideal pair, as the system
155 works better for more closely related languages, and furthermore it
156 does not currently support the full Serbo-Croatian alphabet, but
157 that shouldn't present a problem for the simple examples we'll have
158 here.</p>
159 <!--
160 <div class="comment">We can remove the sentence "As may have been alluded..." if we use <em>xx</em>
161 and <em>yy</em> for the examples up to here.</div>
163 :: I reckon its alright, and good to give concrete examples,
164 although this may get removed when i do a re-write ~fran
167 <h3>A brief note on terms</h3>
168 <p>There are number of terms that will need to be understood before
169 we continue.</p>
170 <p>The first is <em>lemma</em>. A lemma is the citation form of a
171 word. It is the word stripped of any grammatical information. For
172 example, the lemma of the word <em>cats</em> is <em>cat</em>. In
173 English nouns this will typically be the singular form of the word
174 in question. For verbs, the lemma is the infinitive stripped of
175 <em>to</em>, e.g. the lemma of <em>was</em> would be
176 <em>be</em>.</p>
177 <p>The second is <em>symbol</em>. In the context of the Apertium
178 system, <em>symbol</em> refers to a grammatical label. The word <cite>cats</cite> is
179 a plural noun, therefore it will have the <em>noun</em> symbol and the
180 <em>plural</em> symbol. In the input and output of Apertium modules these are
181 typically given between angle brackets, as follows:</p>
182 <ul>
183 <li><code>&lt;n&gt;</code> for noun.</li>
184 <li><code>&lt;pl&gt</code>; for plural.</li>
185 </ul>
186 <!--
187 <div class="comment">Make it clear that symbols appear in angle brackets in
188 the input and the output of modules, but not in dictionaries?</div>
190 :: Good point, done. Feel free to rephrase. ~fran
193 <p>Other examples of symbols are <code>&lt;sg&gt;</code>; singular, <code>&lt;p1&gt;</code>
194 first person, <code>&lt;pri&gt;</code> present indicative etc. When written in
195 angle brackets, the symbols may also be referred to as <em>tags</em>. It is worth
196 noting that in many of the currently available language pairs the symbol definitions
197 are acronyms or contractions of words in Catalan. For example, <em>vbhaver</em>
198 &mdash; from <em>vb</em> (verb) and <em>haver</em> ("to have" in Catalan).
199 Symbols are defined in <code>&lt;sdef&gt;</code> tags and used in <code>&lt;s&gt;</code>
200 tags. </p>
201 <!--
202 <div class="comment">Should we warn the reader that symbol names in many of the available language
203 pair packages are in Catalan?</div>
205 :: Good idea, done. Feel free to rephrase. ~fran
208 <p>The third word is <em>paradigm</em>. In the context of the
209 Apertium system, paradigm refers to a example of how a particular
210 group of words inflect. In the morphological dictionary, lemmas
211 (see above) are linked to paradigms which allows us to describe how
212 a given lemma inflects without having to write out all of the
213 endings.</p>
214 <p>An example of the utility of this is, if we wanted to store the
215 two adjectives <em>happy</em> and <em>lazy</em>, instead of storing
216 two lots of the same thing:</p>
217 <ul>
218 <li><em>happy</em>, <em>happ</em> (<em>y</em>, <em>ier</em>,
219 <em>iest</em>)</li>
220 <li><em>lazy</em>, <em>laz</em> (<em>y</em>, <em>ier</em>,
221 <em>iest</em>)</li>
222 </ul>
223 <p>We can simply store one, and then say "<em>lazy</em>, inflects
224 like <em>happy</em>", or indeed "<em>shy</em> inflects like
225 <em>happy</em>", "<em>naughty</em> inflects like <em>happy</em>",
226 "<em>friendly</em> inflects like <em>happy</em>" etc. In this
227 example, happy would be the paradigm, the model for how the others
228 inflect. The precise description of how this is defined will be
229 explained shortly. Paradigms are defined in
230 <code>&lt;pardef&gt;</code> tags, and used in
231 <code>&lt;par&gt;</code> tags.</p>
232 <h3>Getting started</h3>
233 <h4>Monolingual dictionaries</h4>
234 <p>Lets start by making our first source language dictionary. The
235 dictionary is an XML file. Fire up your text editor and type the
236 following:</p>
237 <p><tt>&lt;?xml version="1.0" encoding="ISO-8859-1"?&gt;<br />
238 &lt;dictionary&gt;<br />
239 <br />
240 &lt;/dictionary&gt;</tt></p>
241 <p>Save the file as <code>apertium-sh-en.sh.dix</code> with an
242 ISO-8859-1 encoding. A short note on encoding: currently (as of
243 April 2007), Apertium only supports the ISO-8859-1 single byte
244 encoding. There is work ongoing to port it to Unicode (indeed an
245 experimental version of <code>lttoolbox</code> with UTF-8 support
246 is available from the SVN repository on the Apertium project
247 site).</p>
248 <p>Note: It is important to have your locale set up correctly when
249 writing/reading files, you can find out your current locale setting
250 by doing <code>echo $LANG</code> from a shell.</p>
251 <p>So, the file so far defines that we want to start a dictionary.
252 In order for it to be useful, we need to add some more entries, the
253 first is an alphabet. This defines the set of letters that may be
254 used in the dictionary, for Serbo-Croatian. Normally it would look
255 something like the following, containing all the letters of the
256 Serbo-Croatian alphabet:</p>
258 <tt>&lt;alphabet&gt;ABCČĆDDžĐEFGHIJKLLjMNNjOPRSŠTUVZŽabcčćddžđefghijklljmnnjoprsštuvzž&lt;/alphabet&gt;</tt></p>
259 <p>However in our example, it will look like this:</p>
261 <tt>&lt;alphabet&gt;ABCDEFGHIJKLMNOPRSTUVZabcdefghijklmnoprstuvz&lt;/alphabet&gt;</tt></p>
262 <p>The reason for this is that, as mentioned above,
263 <code>lttoolbox</code> requires ISO-8859-1 encoding, and
264 <em>Č</em>, <em>Ć</em>, <em></em>, <em>Đ</em>, <em>Lj</em>,
265 <em>Nj</em>, <em>Š</em>, and <em><em>Ž</em></em> (along with their
266 minuscule forms) are not found in this encoding. Some languages
267 have got round this by choosing other characters from ISO-8859-1 to
268 represent the missing letters, and then transliterating. For
269 example, using the character 'ç' (c with cedilla) to represent 'ć'
270 (c with acute accent), or using 'ð' (eth) to represent 'đ' (d with
271 stroke). We will not be using this method, although an example of
272 its use may be found in the Romanian-Spanish translation pair.</p>
273 <!--
274 <div class="comment">I don't think <em>lj</em> and <em>nj</em> pose any problem, since they are not
275 special letters but digraphs. Are you really going to use these transliterations?</div>
277 :: lj and nj can be written as both digraphs (or, more 'properly' as
278 paired characters, like æ - note that we won't use them or the
279 transliterations. ~fran
282 <p>Place the alphabet below the <tt>&lt;dictionary&gt;</tt>
283 tag.</p>
284 <p>Next we need to define some symbols. Lets start off with the
285 simple stuff, noun (<code>n</code>) in singular (<code>sg</code>)
286 and plural (<code>pl</code>).</p>
287 <p><tt>&lt;sdefs&gt;<br />
288 &nbsp;&nbsp; &lt;sdef n="n"/&gt;<br />
289 &nbsp;&nbsp; &lt;sdef n="sg"/&gt;<br />
290 &nbsp;&nbsp; &lt;sdef n="pl"/&gt;<br />
291 &lt;/sdefs&gt;<br /></tt></p>
292 <p>The symbol names do not have to be so small, in fact they could
293 be just written our in full, but as you'll be typing them a lot, it
294 makes sense to abbreviate.</p>
295 <p>Unfortunately, it isn't quite so simple, nouns in Serbo-Croatian
296 inflect for more than just number, they also inflect for gender and
297 case. However, we'll assume for the purposes of this example that
298 the noun is masculine and in the nominative case (a full example
299 may be found at the end of this document).</p>
300 <p>Next thing is to define a section for the paradigms,</p>
301 <p><tt>&lt;pardefs&gt;<br />
302 <br />
303 &lt;/pardefs&gt;<br /></tt></p>
304 <p>and a dictionary section:</p>
305 <p><tt>&lt;section id="main" type="standard"&gt;<br />
306 <br />
307 &lt;/section&gt;<br /></tt></p>
309 There are two types of sections, the first is a <code>standard</code>
310 section, which contains words, enclitics etc. The second type is an
311 <code>inconditional</code> section which typically contains
312 punctuation etc. We don't have an inconditional section here, although
313 it will be demonstrated later.
314 </p>
315 <!--
316 <div class="comment">Perhaps we should say that each <code>&lt;section&gt;</code> is a dictionary section withcertain properties; for instance, a <code>standard</code> section contains words that will
317 only be segmented if the next character is out of the alphabet, whereas a <code>unconditional</code> section
318 contains words that will be segmented regardless of the following character, etc. What is
319 called "a section for mapping lemmas to paradigms" is simply a "dictionary section".</div>
321 :: we should include an example of 'splitting up' / i'm not
322 quite sure how this works. ~fran
325 <p>So, our file should now look something like:</p>
326 <p><tt>&lt;?xml version="1.0" encoding="ISO-8859-1"?&gt;<br />
327 &lt;dictionary&gt;<br />
328 &nbsp;&nbsp; &lt;sdefs&gt;<br />
329 &nbsp;&nbsp;&nbsp;&nbsp; &lt;sdef n="n"/&gt;<br />
330 &nbsp;&nbsp;&nbsp;&nbsp; &lt;sdef n="sg"/&gt;<br />
331 &nbsp;&nbsp;&nbsp;&nbsp; &lt;sdef n="pl"/&gt;<br />
332 &nbsp;&nbsp; &lt;/sdefs&gt;<br />
333 &nbsp;&nbsp; &lt;pardefs&gt;<br />
334 <br />
335 &nbsp;&nbsp; &lt;/pardefs&gt;<br />
336 &nbsp;&nbsp; &lt;section id="main" type="standard"&gt;<br />
337 <br />
338 &nbsp;&nbsp; &lt;/section&gt;<br />
339 &lt;/dictionary&gt;</tt></p>
340 <p>Now we've got the skeleton in place, we can start by adding a
341 noun. The noun in question will be 'gramofon' (which means
342 'gramophone' or 'record player').</p>
343 <p>The first thing we need to do, as we have no prior paradigms, is
344 to define a paradigm.</p>
345 <p>Remember we're assuming masculine gender and nominative case.
346 The singular form of the noun is 'gramofon', and the plural is
347 'gramofoni'. So:</p>
348 <p><tt>&lt;pardef n="gramofon__n"&gt;<br />
349 &nbsp;&nbsp; &lt;e&gt;<br />
350 &nbsp;&nbsp;&nbsp;&nbsp; &lt;p&gt;<br />
351 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &lt;l/&gt;<br />
352 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &lt;r&gt;&lt;s n="n"/&gt;&lt;s
353 n="sg"/&gt;&lt;/r&gt;<br />
354 &nbsp;&nbsp;&nbsp;&nbsp; &lt;/p&gt;<br />
355 &nbsp;&nbsp; &lt;/e&gt;<br />
356 &nbsp;&nbsp; &lt;e&gt;<br />
357 &nbsp;&nbsp;&nbsp;&nbsp; &lt;p&gt;<br />
358 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &lt;l&gt;i&lt;/l&gt;<br />
359 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &lt;r&gt;&lt;s n="n"/&gt;&lt;s
360 n="pl"/&gt;&lt;/r&gt;<br />
361 &nbsp;&nbsp;&nbsp;&nbsp; &lt;/p&gt;<br />
362 &nbsp;&nbsp; &lt;/e&gt;<br />
363 &lt;/pardef&gt;</tt></p>
364 <p>Note: the '&lt;l/&gt;' (equivalent to &lt;l&gt;&lt;/l&gt;)
365 denotes that there is no extra material to be added to the stem for
366 the singular.</p>
367 <p>This may seem like a rather verbose way of describing it, but
368 there are reasons for it and it quickly becomes second nature.
369 You're probably wondering what the &lt;e&gt;, &lt;p&gt;, &lt;l&gt;
370 and &lt;r&gt; stand for. Well,</p>
371 <ul>
372 <li><code>e</code>, is for <em>entry</em>.</li>
373 <li><code>p</code>, is for <em>pair</em>.</li>
374 <li><code>l</code>, is for <em>left</em>.</li>
375 <li><code>r</code>, is for <em>right</em>.</li>
376 </ul>
377 <p>Why <em>left</em> and <em>right</em>? Well, the morphological
378 dictionaries will later be compiled into finite state machines.
379 Compiling them left to right produces analyses from words, and from
380 right to left produces words from analyses. For example:</p>
381 <ul>
382 <li><code>gramofoni</code> (left to right)
383 <code>gramofon&lt;n&gt;&lt;pl&gt;</code> (analysis)</li>
384 <li><code>gramofon&lt;n&gt;&lt;pl&gt;</code> (right to left)
385 <code>gramofoni</code> (generation)</li>
386 </ul>
387 <p>Now we've defined a paradigm, we need to link it to its lemma,
388 <em>gramofon</em>. We put this in the <code>section</code> that
389 we've defined.</p>
390 <p>The entry to put in will look like:</p>
391 <p><tt>&lt;e lm="gramofon"&gt;&lt;i&gt;gramofon&lt;/i&gt;&lt;par
392 n="gramofon__n"/&gt;&lt;/e&gt;</tt></p>
393 <p>A quick run down on the abbreviations:</p>
394 <ul>
395 <li><code>lm</code>, is for <em>lemma</em>.</li>
396 <li><code>i</code>, is for <em>identity</em> (the left and the
397 right are the same).</li>
398 <li><code>par</code>, is for <em>paradigm</em>.</li>
399 </ul>
400 <p>This entry states the lemma of the word, <code>gramofon</code>,
401 the root, <code>gramofon</code> and the paradigm with which it
402 inflects <code>gramofon__n</code>. The difference between the lemma
403 and the root is that the lemma is the citation form of the word,
404 while the root is the substring of the lemma to which stems are
405 added. This will become clearer later when we show an entry where
406 the two are different.</p>
407 <p>We're now ready to test the dictionary. Save it, and then return
408 to the shell. We first need to compile it (with
409 <code>lt-comp</code>), then we can test it (with
410 <code>lt-proc</code>).</p>
411 <pre>
412 $ lt-comp lr apertium-sh-en.sh.dix sh-en.automorf.bin
413 </pre>
414 <p>Should produce the output:</p>
415 <pre>
416 main@standard 12 12
417 </pre>
418 <p>As we are compiling it left to right, we're producing an
419 <em>analyser</em>. Lets make a <em>generator</em> too.</p>
420 <pre>
421 $ lt-comp rl apertium-sh-en.sh.dix sh-en.autogen.bin
422 </pre>
423 <p>At this stage, the command should produce the same output.</p>
424 <!--
425 <div class="comment">Not necessarily true for the examples below in which you
426 equate <em>gramofoni</em> (nominative) and <em>gramophone</em> (accusative); when generating, you should decide which one...
427 </div>
429 :: at the moment we're not dealing with generating serbo-croatian, and
430 the 'output' refers to the 'main@standard 12 12' part. I've clarified this
431 somewhat ~fran
433 <p>We can now test these. Run <code>lt-proc</code> on the
434 analyser.</p>
435 <pre>
436 $ lt-proc sh-en.automorf.bin
437 </pre>
438 <p>Now try it out, type in <code>gramofoni</code>
439 (<em>gramophones</em>), and see the output:</p>
440 <pre>
441 ^gramofoni/gramofon&lt;n&gt;&lt;pl&gt;$
442 </pre>
443 <p>Now, for the English dictionary, do the same thing, but
444 substitute the English word <em>gramophone</em> for <em>gramofon</em>, and
445 change the plural inflection. What if you want to use the more correct
446 word 'record player', well, we'll explain how to do that later.</p>
447 <!--
448 <div class="comment">Check</div>
450 :: I added that they should change the plural inflection (i -> s),
451 what this what the check referred to ? ~fran
453 <p>You should now have two files in the directory:</p>
454 <ul>
455 <li><code>apertium-sh-en.sh.dix</code> which contains a (very)
456 basic Serbo-Croatian morphological dictionary, and</li>
457 <li><code>apertium-sh-en.en.dix</code> which contains a (very)
458 basic English morphological dictionary.</li>
459 </ul>
460 <h4>Bilingual dictionary</h4>
461 <p>So we now have two morphological dictionaries, next thing to
462 make is the bilingual dictionary. This describes mappings between
463 words. All dictionaries use the same format (which is specified in
464 the DTD, <code>dix.dtd</code>).</p>
465 <p>Create a new file, <code>apertium-sh-en.sh-en.dix</code> and add
466 the basic skeleton:</p>
467 <p><tt>&lt;?xml version="1.0" encoding="ISO-8859-1"?&gt;<br />
468 &lt;dictionary&gt;<br />
469 &nbsp;&nbsp; &lt;alphabet/&gt;<br />
470 &nbsp;&nbsp; &lt;sdefs&gt;<br />
471 &nbsp;&nbsp;&nbsp;&nbsp; &lt;sdef n="n"/&gt;<br />
472 &nbsp;&nbsp;&nbsp;&nbsp; &lt;sdef n="sg"/&gt;<br />
473 &nbsp;&nbsp;&nbsp;&nbsp; &lt;sdef n="pl"/&gt;<br />
474 &nbsp;&nbsp; &lt;/sdefs&gt;<br />
475 <br />
476 &nbsp;&nbsp; &lt;section id="main" type="standard"&gt;<br />
477 <br />
478 &nbsp;&nbsp; &lt;/section&gt;<br />
479 &lt;/dictionary&gt;</tt></p>
480 <p>Now we need to add an entry to <em>translate</em> between the
481 two words. Something like:</p>
482 <p><tt>&lt;e&gt;&lt;p&gt;&lt;l&gt;gramofon&lt;s
483 n="n"/&gt;&lt;/l&gt;&lt;r&gt;gramophone&lt;s
484 n="n"/&gt;&lt;/r&gt;&lt;/p&gt;&lt;/e&gt;</tt></p>
485 <!--
486 <div class="comment">Some of this could be done with
487 &lt;i&gt;</div>
489 :: How do you mean? ~fran
491 <p>Because there are a lot of these entries, they're typically
492 written on one line to facilitate easier reading of the file. Again
493 with the '<code>l</code>' and '<code>r</code>' right? Well, we
494 compile it left to right to produce the Serbo-Croatian → English
495 dictionary, and right to left to produce the English →
496 Serbo-Croatian dictionary.</p>
497 <p>So, once this is done, run the following commands:</p>
498 <pre>
499 $ lt-comp lr apertium-sh-en.sh.dix sh-en.automorf.bin
500 $ lt-comp rl apertium-sh-en.sh.dix sh-en.autogen.bin
502 $ lt-comp lr apertium-sh-en.en.dix en-sh.automorf.bin
503 $ lt-comp rl apertium-sh-en.en.dix en-sh.autogen.bin
505 $ lt-comp lr apertium-sh-en.sh-en.dix sh-en.autobil.bin
506 $ lt-comp rl apertium-sh-en.sh-en.dix en-sh.autobil.bin
507 </pre>
508 <p>To generate the morphological analysers (<code>automorf</code>),
509 the morphological generators (<code>autogen</code>) and the word
510 lookups (<code>autobil</code>), the <em>bil</em> is for "bilingual".</p>
511 <!--
512 <div class="comment">These three names come from interNOSTRUM but I
513 am not sure I like them much... unfortunately, they are used all
514 over...</div>
516 :: Aye, they're a bit cryptic, but a lot of the stuff is,
517 when you know what they mean, or where they come from
518 it makes more sense. ~fran
520 <h3>Transfer rules</h3>
521 <p>So, now we have two morphological dictionaries, and a bilingual
522 dictionary. All that we need now is a transfer rule for nouns.
523 Transfer rule files have their own DTD (<code>transfer.dtd</code>)
524 which can be found in the Apertium package. If you need to
525 implement a rule it is often a good idea to look in the rule files
526 of other language pairs first. Many rules can be recycled/reused
527 between languages. For example the one described below would be
528 useful for any null-subject language.</p>
529 <p>Start out like all the others with a basic skeleton:</p>
530 <p><tt>&lt;?xml version="1.0" encoding="ISO-8859-1"?&gt;<br />
531 &lt;transfer&gt;<br />
532 <br />
533 &lt;/transfer&gt;<br /></tt></p>
534 <p>At the moment, because we're ignoring case, we just need to make
535 a rule that takes the grammatical symbols input and outputs them
536 again.</p>
537 <p>We first need to define categories and attributes. Categories
538 and attributes both allow us to group grammatical symbols.
539 Categories allow us to group symbols for the purposes of matching
540 (for example '<code>n.*</code>' is all nouns). Attributes allow us
541 to group a set of symbols that can be chosen from. For example
542 ('<code>sg</code>' and '<code>pl</code>' may be grouped a an
543 attribute '<code>number</code>').</p>
544 <p>Lets add the necessary sections:</p>
545 <p><tt>&lt;section-def-cats&gt;<br />
546 <br />
547 &lt;/section-def-cats&gt;<br />
548 &lt;section-def-attrs&gt;<br />
549 <br />
550 &lt;/section-def-attrs&gt;</tt></p>
551 <p>As we're only inflecting, nouns in singular and plural then we
552 need to add a category for nouns, and with an attribute of number.
553 Something like the following will suffice:</p>
554 <p>Into <code>section-def-cats</code> add:</p>
555 <p><tt>&lt;def-cat n="nom"&gt;<br />
556 &nbsp;&nbsp; &lt;cat-item tags="n.*"/&gt;<br />
557 &lt;/def-cat&gt;</tt></p>
558 <p>This catches all nouns (lemmas followed by &lt;n&gt; then
559 anything) and refers to them as "<code>nom</code>" (we'll see how
560 thats used later). </p>
561 <!--
562 <div class="comment">Catalan names again</div>
564 :: I think its fine for now, we note that the
565 names are fairly arbitrary. ~fran
567 <p>Into the section <code>section-def-attr</code>s, add:</p>
568 <p><tt>&lt;def-attr n="nbr"&gt;<br />
569 &nbsp;&nbsp; &lt;attr-item tags="sg"/&gt;<br />
570 &nbsp;&nbsp; &lt;attr-item tags="pl"/&gt;<br />
571 &lt;/def-attr&gt;</tt></p>
572 <p>and then</p>
573 <p><tt>&lt;def-attr n="a_nom"&gt;<br />
574 &nbsp;&nbsp; &lt;attr-item tags="n"/&gt;<br />
575 &lt;/def-attr&gt;</tt></p>
576 <p>The first defines the attribute <code>nbr</code> (number), which
577 can be either singular (<code>sg</code>) or plural
578 (<code>pl</code>).</p>
579 <p>The second defines the attribute <code>a_nom</code> (attribute
580 <em>noun</em>).</p>
581 <p>Next we need to add a section for global variables:</p>
582 <p><tt>&lt;section-def-vars&gt;<br />
583 <br />
584 &lt;/section-def-vars&gt;</tt></p>
585 <p>These variables are used to store or transfer attributes
586 between rules. We need only one for now,</p>
587 <pre>
588 &lt;def-var n="number"/&gt;
589 </pre>
590 <!--
591 <div class="comment">Perhaps we should be more specific and say
592 that these are <em>state</em> variables which may be used to
593 propagate information computed in a certain application of a rule,
594 to later applications of the same or other rules.</div>
596 :: Hmm, I think this is getting a bit complicated. Although
597 it would be good to give a concrete example of where
598 they are used. But I haven't used these yet. I'll dig
599 around in the manual. ~fran
601 <p>Finally, we need to add a rule, to take in the noun and then
602 output it in the correct form. We'll need a rules section...</p>
603 <p><tt>&lt;section-rules&gt;<br />
604 <br />
605 &lt;/section-rules&gt;</tt></p>
606 <p>Changing the pace from the previous examples, I'll just paste
607 this rule, then go through it, rather than the other way round.</p>
608 <p><tt>&lt;rule&gt;<br />
609 &nbsp;&nbsp; &lt;pattern&gt;<br />
610 &nbsp;&nbsp;&nbsp;&nbsp; &lt;pattern-item n="nom"/&gt;<br />
611 &nbsp;&nbsp; &lt;/pattern&gt;<br />
612 &nbsp;&nbsp; &lt;action&gt;<br />
613 &nbsp;&nbsp;&nbsp;&nbsp; &lt;out&gt;<br />
614 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &lt;lu&gt;<br />
615 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &lt;clip pos="1"
616 side="tl" part="lem"/&gt;<br />
617 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &lt;clip pos="1"
618 side="tl" part="a_nom"/&gt;<br />
619 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &lt;clip pos="1"
620 side="tl" part="nbr"/&gt;<br />
621 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &lt;/lu&gt;<br />
622 &nbsp;&nbsp;&nbsp;&nbsp; &lt;/out&gt;<br />
623 &nbsp;&nbsp; &lt;/action&gt;<br />
624 &lt;/rule&gt;</tt></p>
625 <div class="comment">Many people are used to pattern–action
626 languages such as AWK, perl or lex. How about drawing an analogy
627 here?</div>
628 <p>The first tag is obvious, it defines a rule. The second tag,
629 <code>pattern</code> basically says: "apply this rule, if this
630 pattern is found". In this example the pattern consists of a single
631 noun (defined by the category item <code>nom</code>). Note that
632 patterns are matched in a longest-match first. So if you have three
633 rules, the first catches "&lt;prn&gt;&lt;vblex&gt;&lt;n&gt;", the
634 second catches "&lt;prn&gt;&lt;vblex&gt;" and the third catches
635 "&lt;n&gt;", the pattern matched, and rule executed will be the
636 first.</p>
637 <!--
638 <div class="comment">I would rather use an example in which rules
639 match whole lexical units, because that is the way the transfer
640 works. One cannot write patterns matching parts of lexical units,
641 as far as I know</div>
643 :: Not sure I understand, could you re-write this part? ~fran
645 <p>For each pattern, there is an associated action, which produces
646 an associated output, <code>out</code>. The output, is a lexical
647 unit (<code>lu</code>).</p>
648 <p>The <code>clip</code> tag allows a user to select and manipulate
649 attributes and parts of the source language
650 (<code>side="sl"</code>), or target language
651 (<code>side="tl"</code>) lexical item.</p>
652 <p>Let's compile it and test it. Transfer rules are compiled
653 with:</p>
654 <pre>
655 $ apertium-preprocess-transfer apertium-sh-en.trules-sh-en.xml trules-sh-en.bin
656 </pre>
657 <p>Which will generate a trules-sh-en.bin file.</p>
658 <p>Now we're ready to test our machine translation system. There is
659 one crucial part missing, the part-of-speech (PoS) tagger, but that
660 will be explained shortly. In the meantime we can test it as
661 is:</p>
662 <p>First, lets analyse a word, <em>gramofoni</em>:</p>
663 <pre>
664 $ echo "gramofoni" | lt-proc sh-en.automorf.bin
665 ^gramofon/gramofon&lt;n&gt;&lt;pl&gt;$
666 </pre>
667 <p>Now, normally here the POS tagger would choose the right version
668 based on the part of speech, but we don't have a POS tagger yet, so
669 we can use this little perl script that will just output the first
670 item retrieved.</p>
671 <!--
672 <div class="comment">I am not sure the reader would understand what
673 <em>version</em> means here unless we have an example in which the
674 morphological analyser derives more than one lexical form for a
675 given source form... Perhaps we should explain that, and give an
676 example...</div>
678 :: Good idea. I'll try and work an example in. ~fran
681 <pre>
682 $ echo "gramofoni" | lt-proc sh-en.automorf.bin | \
683 perl -ne 's,^([^/]*/)(.*)$,\^\2,; s,^(.*\$\s\^)[^/]+/(.*)$,\1\2,; print' | \
684 ^gramofon&lt;n&gt;&lt;pl&gt;$
685 </pre>
686 <!--
687 <div class="comment">We are writing the pipelines ourselves, but
688 Apertium uses <code>modes.xml</code> file to control this; perhaps
689 a later version of this howto can explain how to write
690 <code>modes.xml</code> files... if you look at
691 <code>apertium-en-ca</code>, you will see that the
692 <code>modes.xml</code> file contains many modes, to use parts of
693 the pipeline for diagnostics and testsing.</div>
695 :: I'll add something on modes.xml at the end. ~fran
698 <p>Now let's process that with the transfer rule:</p>
699 <pre>
700 $ echo "gramofoni" | lt-proc sh-en.automorf.bin | \
701 perl -ne 's,^([^/]*/)(.*)$,\^\2,; s,^(.*\$\s\^)[^/]+/(.*)$,\1\2,; print' | \
702 apertium-transfer apertium-sh-en.trules-sh-en.xml trules-sh-en.bin sh-en.autobil.bin
703 </pre>
704 <p>It will output:</p>
705 <pre>
706 ^gramophone&lt;n&gt;&lt;pl&gt;$^@
707 </pre>
708 <!--
709 <div class="comment">What's the "^@"?</div>
711 :: I think that its because we don't have any
712 punctuation specified. I'll check it and
713 put a note in. ~fran
716 <ul>
717 <li>'gramophone' is the target language (<code>side="tl"</code>)
718 lemma (<code>lem</code>) at position 1 (<code>pos="1"</code>).</li>
719 <li>'&lt;n&gt;' is the target language <code>a_nom</code> at
720 position 1.</li>
721 <li>'&lt;pl&gt;' is the target language attribute of
722 <em>number</em> (<code>nbr</code>) at position 1.</li>
723 </ul>
724 <p>Try commenting out one of these clip statements, recompiling and
725 seeing what happens.</p>
726 <p>So, now we have the output from the transfer, the only thing
727 that remains is to generate the target-language inflected forms.
728 For this, we use <code>lt-proc</code>, but in generation
729 (<code>-g</code>), not analysis mode.</p>
730 <pre>
731 $ echo "gramofoni" | lt-proc sh-en.automorf.bin | \
732 perl -ne 's,^([^/]*/)(.*)$,\^\2,; s,^(.*\$\s\^)[^/]+/(.*)$,\1\2,; print' | \
733 apertium-transfer apertium-sh-en.trules-sh-en.xml trules-sh-en.bin sh-en.autobil.bin | \
734 lt-proc -g sh-en.autogen.bin
736 gramophones\@
737 </pre>
738 <!--
739 <div class="comment">What is the "\@"?</div>
741 :: I think it is as a result of not having any
742 punctuation defined. ~fran
744 <p>And c'est ca. You now have a machine translation system that
745 translates a Serbo-Croatian noun into an English noun. Obviously
746 this isn't very useful, but we'll get onto the more complex stuff
747 soon. Oh, and don't worry about the '@' symbol, I'll explain that
748 soon too.</p>
749 <p>Think of a few other words that inflect the same as gramofon.
750 How about adding those. We don't need to add any paradigms, just
751 the entries in the main section of the monolingual and bilingual
752 dictionaries.</p>
753 <h3>Bring on the verbs</h3>
754 <p>Ok, so we have a system that translates nouns, but thats pretty
755 useless, we want to translate verbs too, and even whole sentences!
756 How about we start with the verb <cite>to see</cite>. In
757 Serbo-Croatian this is <cite>videti</cite>. Serbo-Croatian is a
758 null-subject language, this means that it doesn't typically use
759 personal pronouns before the conjugated form of the verb. English
760 is not. So for example: <cite>I see</cite> in English would be
761 translated as <cite>vidim</cite> in Serbo-Croatian.</p>
762 <pre>
763 * Vidim
764 * see&lt;p1&gt;&lt;sg&gt;
765 * I see
766 </pre>
767 <p>Note: &lt;p1&gt; denotes <em>first person</em></p>
768 <p>This will be important when we come to write the transfer rule
769 for verbs. Other examples of null-subject languages include:
770 Spanish, Romanian and Polish. The also has the effect that while we
771 only need to add the verb in the Serbo-Croatian morphological
772 dictionary, we need to add both the verb, and the personal pronouns
773 in the English morpohlogical dictionary. We'll go through both of
774 these.</p>
775 <p>The other forms of the verb <cite>videti</cite> are:
776 <cite>vidiš</cite>, <cite>vidi</cite>, <cite>vidimo</cite>,
777 <cite>vidite</cite>, and <cite>vide</cite>; which correspond to:
778 <cite>you see</cite> (singular), <cite>he sees</cite>, <cite>we
779 see</cite>, <cite>you see</cite> (plural), and <cite>they
780 see</cite>.</p>
781 <p>There are two forms of <cite>you see</cite>, one is plural and
782 formal singular (<cite>vidite</cite>) and the other is singular and
783 informal (<cite>vidiš</cite>).</p>
784 <p>We're going to try and translate the sentence: <cite>Vidim
785 gramofoni</cite> into <cite>I see gramophones</cite>. In the
786 interests of space, we'll just add enough information to do the
787 translation and will leave filling out the paradigms (adding the
788 other conjugations of the verb) as an exercise to the reader.</p>
789 <p>The astute reader will have realised by this point that we can't
790 just translate <cite>vidim gramofoni</cite> because it is not a
791 grammatically correct sentence in Serbo-Croatian. The correct
792 sentence would be <cite>vidim gramofone</cite>, as the noun takes
793 the accusative case. We'll have to add that form too, no need to
794 add the case information for now though, we just add it as another
795 option for plural. So, just copy the '<code>e</code>' block for
796 '<code>i</code>' and change the '<code>i</code>' to
797 '<code>e</code>' there.</p>
798 <!--
799 <div class="comment">Mikel's comments so far end here.</div>
801 <p>First thing we need to do is add some more symbols. We need to
802 first add a symbol for 'verb', which we'll call "vblex" (this means
803 lexical verb, as opposed to modal verbs and other types). Verbs
804 have 'person', and 'tense' along with number, so lets add a couple
805 of those aswell. We need to translate "I see", so for person we
806 should add "p1", or 'first person', and for tense "pri", or
807 'present indicative'.</p>
808 <pre>
809 &lt;sdef n="vblex"/&gt;
810 &lt;sdef n="p1"/&gt;
811 &lt;sdef n="pri"/&gt;
812 </pre>
813 <p>After we've done this, the same with the nouns, we add a
814 paradigm for the verb conjugation. The first line will be:</p>
815 <p><tt>&lt;pardef n="vid/eti__vblex"&gt;</tt></p>
816 <p>The '/' is used to demarcate where the stems (the parts between
817 the &lt;l&gt; &lt;/l&gt; tags) are added to.</p>
818 <p>Then the inflection for first person singular:</p>
819 <p><tt>&lt;e&gt;<br />
820 &nbsp;&nbsp; &lt;p&gt;<br />
821 &nbsp;&nbsp;&nbsp;&nbsp; &lt;l&gt;im&lt;/l&gt;<br />
822 &nbsp;&nbsp;&nbsp;&nbsp; &lt;r&gt;eti&lt;s n="vblex"/&gt;&lt;s
823 n="pri"/&gt;&lt;s n="p1"/&gt;&lt;s n="sg"/&gt;&lt;/r&gt;<br />
824 &nbsp;&nbsp; &lt;/p&gt;<br />
825 &lt;/e&gt;</tt></p>
826 <p>The 'im' denotes the ending (as in 'vidim'), it is necessary to
827 add 'eti' to the &lt;r&gt; section, as this will be chopped off by
828 the definition. The rest is fairly straightforward, 'vblex' is
829 lexical verb, 'pri' is present indicative tense, 'p1' is first
830 person and 'sg' is singular. We can also add the plural which will
831 be the same, except 'imo' instead of 'im' and 'pl' instead of
832 'sg'.</p>
833 <p>After this we need to add a lemma, paradigm mapping to the main
834 section:</p>
835 <p><tt>&lt;e lm="videti"&gt;&lt;i&gt;vid&lt;/i&gt;&lt;par
836 n="vid/eti__vblex"/&gt;&lt;/e&gt;</tt></p>
837 <p>Note: the content of &lt;i&gt; &lt;/i&gt; is the root, not the
838 lemma.</p>
839 <p>Thats the work on the Serbo-Croatian dictionary done for now.
840 Lets compile it then test it.</p>
841 <pre>
842 $ lt-comp lr apertium-sh-en.sh.dix sh-en.automorf.bin
843 main@standard 23 25
844 $ echo "vidim" | lt-proc sh-en.automorf.bin
845 ^vidim/videti&lt;vblex&gt;&lt;pri&gt;&lt;p1&gt;&lt;sg&gt;$
846 $ echo "vidimo" | lt-proc sh-en.automorf.bin
847 ^vidimo/videti&lt;vblex&gt;&lt;pri&gt;&lt;p1&gt;&lt;pl&gt;$
848 </pre>
849 <p>Ok, so now we do the same for the English dictionary (remember
850 to add the same symbol definitions here as you added to the
851 Serbo-Croatian one).</p>
852 <p>The paradigm is:</p>
853 <p><tt>&lt;pardef n="s/ee__vblex"&gt;</tt></p>
854 <p>because the past tense is 'saw'. Now, we can do one of two
855 things, we can add both first and second person, but they are the
856 same form. In fact, all forms (except third person singular) of the
857 verb 'to see' are 'see'. So instead we make one entry for 'see' and
858 give it only the 'pri' symbol.</p>
859 <p><tt>&lt;e&gt;<br />
860 &nbsp;&nbsp; &lt;p&gt;<br />
861 &nbsp;&nbsp;&nbsp;&nbsp; &lt;l&gt;ee&lt;/l&gt;<br />
862 &nbsp;&nbsp;&nbsp;&nbsp; &lt;r&gt;ee&lt;s n="vblex"/&gt;&lt;s
863 n="pri"/&gt;&lt;/r&gt;<br />
864 &nbsp;&nbsp; &lt;/p&gt;<br />
865 &lt;/e&gt;</tt></p>
866 <p>and as always, an entry in the main section:</p>
867 <p><tt>&lt;e lm="see"&gt;&lt;i&gt;s&lt;/i&gt;&lt;par
868 n="s/ee__vblex"/&gt;&lt;/e&gt;</tt></p>
869 <p>Then lets save, recompile and test:</p>
870 <pre>
871 $ lt-comp lr apertium-sh-en.en.dix en-sh.automorf.bin
872 main@standard 18 19
874 $ echo "see" | lt-proc en-sh.automorf.bin
875 ^see/see&lt;vblex&gt;&lt;pri&gt;$
876 </pre>
877 <p>Now for the obligatory entry in the bilingual dictionary:</p>
878 <pre>
879 &lt;e&gt;&lt;p&gt;&lt;l&gt;videti&lt;s n="vblex"/&gt;&lt;/l&gt;&lt;r&gt;see&lt;s n="vblex"/&gt;&lt;/r&gt;&lt;/p&gt;&lt;/e&gt;
880 </pre>
881 <p>(again, don't forget to add the sdefs from earlier)</p>
882 <p>And recompile:</p>
883 <pre>
884 $ lt-comp lr apertium-sh-en.sh-en.dix sh-en.autobil.bin
885 main@standard 18 18
886 $ lt-comp rl apertium-sh-en.sh-en.dix en-sh.autobil.bin
887 main@standard 18 18
888 </pre>
889 <p>Now to test:</p>
890 <pre>
891 $ echo "vidim" | lt-proc sh-en.automorf.bin | \
892 perl -ne 's,^([^/]*/)(.*)$,\^\2,; s,^(.*\$\s\^)[^/]+/(.*)$,\1\2,; print' | \
893 apertium-transfer apertium-sh-en.trules-sh-en.xml trules-sh-en.bin sh-en.autobil.bin
895 ^see&lt;vblex&gt;&lt;pri&gt;&lt;p1&gt;&lt;sg&gt;$^@
896 </pre>
897 <p>We get the analysis passed through correctly, but when we try
898 and generate a surface form from this, we get a '#', like
899 below:</p>
900 <pre>
901 $ echo "vidim" | lt-proc sh-en.automorf.bin | \
902 perl -ne 's,^([^/]*/)(.*)$,\^\2,; s,^(.*\$\s\^)[^/]+/(.*)$,\1\2,; print' | \
903 apertium-transfer apertium-sh-en.trules-sh-en.xml trules-sh-en.bin sh-en.autobil.bin | \
904 lt-proc -g sh-en.autogen.bin
905 #see\@
906 </pre>
907 <p>This '#' means that the generator cannot generate the correct
908 lexical form because it does not contain it. Why is this?</p>
909 <p>Basically the analyses don't match, the 'see' in the dictionary
910 is see&lt;vblex&gt;&lt;pri&gt;, but the see delivered by the
911 transfer is see&lt;vblex&gt;&lt;pri&gt;&lt;p1&gt;&lt;sg&gt;. The
912 Serbo-Croatian side has more information than the English side
913 requires. You can test this by adding the missing symbols to the
914 English dictionary, and then recompiling, and testing again.</p>
915 <p>However, a more paradigmatic way of taking care of this is by
916 writing a rule. So, we open up the rules file
917 (apertium-sh-en.trules-sh-en.xml in case you forgot).</p>
918 <p>We need to add a new category for 'verb'.</p>
919 <p><tt>&lt;def-cat n="vrb"&gt;<br />
920 &nbsp;&nbsp; &lt;cat-item tags="vblex.*"/&gt;<br />
921 &lt;/def-cat&gt;</tt></p>
922 <p>We also need to add attributes for tense and for person. We'll
923 make it really simple for now, you can add p2 and p3, but I won't
924 in order to save space.</p>
925 <p><tt>&lt;def-attr n="temps"&gt;<br />
926 &nbsp;&nbsp; &lt;attr-item tags="pri"/&gt;<br />
927 &lt;/def-attr&gt;<br />
928 <br />
929 &lt;def-attr n="pers"&gt;<br />
930 &nbsp;&nbsp; &lt;attr-item tags="p1"/&gt;<br />
931 &lt;/def-attr&gt;</tt></p>
932 <p>We should also add an attribute for verbs.</p>
933 <p><tt>&lt;def-attr n="a_verb"&gt;<br />
934 &nbsp;&nbsp; &lt;attr-item tags="vblex"/&gt;<br />
935 &lt;/def-attr&gt;</tt></p>
936 <p>Now onto the rule:</p>
937 <p><tt>&lt;rule&gt;<br />
938 &nbsp;&nbsp; &lt;pattern&gt;<br />
939 &nbsp;&nbsp;&nbsp;&nbsp; &lt;pattern-item n="vrb"/&gt;<br />
940 &nbsp;&nbsp; &lt;/pattern&gt;<br />
941 &nbsp;&nbsp; &lt;action&gt;<br />
942 &nbsp;&nbsp;&nbsp;&nbsp; &lt;out&gt;<br />
943 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &lt;lu&gt;<br />
944 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &lt;clip pos="1"
945 side="tl" part="lem"/&gt;<br />
946 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &lt;clip pos="1"
947 side="tl" part="a_verb"/&gt;<br />
948 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &lt;clip pos="1"
949 side="tl" part="temps"/&gt;<br />
950 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &lt;/lu&gt;<br />
951 &nbsp;&nbsp;&nbsp;&nbsp; &lt;/out&gt;<br />
952 &nbsp;&nbsp; &lt;/action&gt;<br />
953 &lt;/rule&gt;<br /></tt></p>
954 <p>Remember when you tried commenting out the 'clip' tags in the
955 previous rule example and they disappeared from the transfer, well,
956 thats pretty much what we're doing here. We take in a verb with a
957 full analysis, but only output a partial analysis (lemma + verb tag
958 + tense tag).</p>
959 <p>So now, if we recompile that, we get:</p>
960 <pre>
961 $ echo "vidim" | lt-proc sh-en.automorf.bin | \
962 perl -ne 's,^([^/]*/)(.*)$,\^\2,; s,^(.*\$\s\^)[^/]+/(.*)$,\1\2,; print' | \
963 apertium-transfer apertium-sh-en.trules-sh-en.xml trules-sh-en.bin sh-en.autobil.bin
964 ^see&lt;vblex&gt;&lt;pri&gt;$^@
965 </pre>
966 <p>and:</p>
967 <pre>
968 $ echo "vidim" | lt-proc sh-en.automorf.bin | \
969 perl -ne 's,^([^/]*/)(.*)$,\^\2,; s,^(.*\$\s\^)[^/]+/(.*)$,\1\2,; print' | \
970 apertium-transfer apertium-sh-en.trules-sh-en.xml trules-sh-en.bin sh-en.autobil.bin | \
971 lt-proc -g sh-en.autogen.bin
972 see\@
973 </pre>
974 <p>Try it with 'vidimo' (we see) to see if you get the correct
975 output.</p>
976 <p>Now try it with "vidim gramofone":</p>
977 <pre>
978 $ echo "vidim gramofoni" | lt-proc sh-en.automorf.bin | \
979 perl -ne 's,^([^/]*/)(.*)$,\^\2,; s,^(.*\$\s\^)[^/]+/(.*)$,\1\2,; print' | \
980 apertium-transfer apertium-sh-en.trules-sh-en.xml trules-sh-en.bin sh-en.autobil.bin | \
981 lt-proc -g sh-en.autogen.bin
982 see gramophones\@
983 </pre>
984 <h3>But what about personal pronouns?</h3>
985 <p>Well, thats great, but we're still missing the personal pronoun
986 that is necessary in English. In order to add it in, we first need
987 to edit the English morphological dictionary.</p>
988 <p>As before, the first thing to do is add the necessary
989 symbols:</p>
990 <p><tt>&lt;sdef n="prn"/&gt;<br />
991 &lt;sdef n="subj"/&gt;</tt></p>
992 <p>Of the two symbols, prn is pronoun, and subj is subject (as in
993 the subject of a sentence).</p>
994 <p>Because there is no root, or 'lemma' for personal subject
995 pronouns, we just add the pardef as follows:</p>
996 <p><tt>&lt;pardef n="prsubj__prn"&gt;<br />
997 &nbsp;&nbsp; &lt;e&gt;<br />
998 &nbsp;&nbsp;&nbsp;&nbsp; &lt;p&gt;<br />
999 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &lt;l&gt;I&lt;/l&gt;<br />
1000 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &lt;r&gt;prpers&lt;s
1001 n="prn"/&gt;&lt;s n="subj"/&gt;&lt;s n="p1"/&gt;&lt;s
1002 n="sg"/&gt;&lt;/r&gt;<br />
1003 &nbsp;&nbsp;&nbsp;&nbsp; &lt;/p&gt;<br />
1004 &nbsp;&nbsp; &lt;/e&gt;<br />
1005 &lt;/pardef&gt;</tt></p>
1006 <p>With 'prsubj' being 'personal subject'. The rest of them (You,
1007 We etc.) are left as an exercise to the reader.</p>
1008 <p>We can add an entry to the main section as follows:</p>
1009 <pre>
1010 &lt;e lm="personal subject pronouns"&gt;&lt;i/&gt;&lt;par n="prsubj__prn"/&gt;&lt;/e&gt;
1011 </pre>
1012 <p>So, save, recompile and test, and we should get something
1013 like:</p>
1014 <pre>
1015 $ echo "I" | lt-proc en-sh.automorf.bin
1016 ^I/PRPERS&lt;prn&gt;&lt;subj&gt;&lt;p1&gt;&lt;sg&gt;$
1017 </pre>
1018 <p>(Note: its in capitals because 'I' is in capitals).</p>
1019 <p>Now we need to amend the 'verb' rule to output the subject
1020 personal pronoun along with the correct verb form.</p>
1021 <p>First, add a category (this must be getting pretty pedestrian by
1022 now):</p>
1023 <p><tt>&lt;def-cat n="prpers"&gt;<br />
1024 &nbsp;&nbsp; &lt;cat-item lemma="prpers" tags="prn.*"/&gt;<br />
1025 &lt;/def-cat&gt;</tt></p>
1026 <p>Now add the types of pronoun as attributes, we might as well add
1027 the 'obj' type as we're at it, although we won't need to use it for
1028 now:</p>
1029 <p><tt>&lt;def-attr n="tipus_prn"&gt;<br />
1030 &nbsp;&nbsp; &lt;attr-item tags="prn.subj"/&gt;<br />
1031 &nbsp;&nbsp; &lt;attr-item tags="prn.obj"/&gt;<br />
1032 &lt;/def-attr&gt;</tt></p>
1033 <p>And now to input the rule:</p>
1034 <p><tt>&lt;rule&gt;<br />
1035 &nbsp;&nbsp; &lt;pattern&gt;<br />
1036 &nbsp;&nbsp;&nbsp;&nbsp; &lt;pattern-item n="vrb"/&gt;<br />
1037 &nbsp;&nbsp; &lt;/pattern&gt;<br />
1038 &nbsp;&nbsp; &lt;action&gt;<br />
1039 &nbsp;&nbsp;&nbsp;&nbsp; &lt;out&gt;<br />
1040 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &lt;lu&gt;<br />
1041 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &lt;lit
1042 v="prpers"/&gt;<br />
1043 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &lt;lit-tag
1044 v="prn"/&gt;<br />
1045 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &lt;lit-tag
1046 v="subj"/&gt;<br />
1047 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &lt;clip pos="1"
1048 side="tl" part="pers"/&gt;<br />
1049 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &lt;clip pos="1"
1050 side="tl" part="nbr"/&gt;<br />
1051 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &lt;/lu&gt;<br />
1052 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &lt;b/&gt;<br />
1053 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &lt;lu&gt;<br />
1054 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &lt;clip pos="1"
1055 side="tl" part="lem"/&gt;<br />
1056 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &lt;clip pos="1"
1057 side="tl" part="a_verb"/&gt;<br />
1058 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &lt;clip pos="1"
1059 side="tl" part="temps"/&gt;<br />
1060 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &lt;/lu&gt;<br />
1061 &nbsp;&nbsp;&nbsp;&nbsp; &lt;/out&gt;<br />
1062 &nbsp;&nbsp; &lt;/action&gt;<br />
1063 &lt;/rule&gt;</tt></p>
1064 <p>This is pretty much the same rule as before, only we made a
1065 couple of small changes.</p>
1066 <p>We needed to output:</p>
1067 <p><tt>^prpers&lt;prn&gt;&lt;subj&gt;&lt;p1&gt;&lt;sg&gt;$
1068 ^see&lt;vblex&gt;&lt;pri&gt;$</tt></p>
1069 <p>so that the generator could choose the right pronoun and the
1070 right form of the verb.</p>
1071 <p>So, a quick rundown:</p>
1072 <ul>
1073 <li>&lt;lit&gt;, prints a literal string, in this case
1074 "prpers"</li>
1075 <li>&lt;lit-tag&gt;, prints a literal tag, because we can't get the
1076 tags from the verb, we add these ourself, "prn" for pronoun, and
1077 "subj" for subject.</li>
1078 <li>&lt;b/&gt;, prints a blank, a space.</li>
1079 </ul>
1080 <p>Note that we retrieved the information for number and tense
1081 directly from the verb.</p>
1082 <p>So, now if we recompile and test that again:</p>
1083 <pre>
1084 $ echo "vidim gramofone" | lt-proc sh-en.automorf.bin | \
1085 perl -ne 's,^([^/]*/)(.*)$,\^\2,; s,^(.*\$\s\^)[^/]+/(.*)$,\1\2,; print' | \
1086 apertium-transfer apertium-sh-en.trules-sh-en.xml trules-sh-en.bin sh-en.autobil.bin | \
1087 lt-proc -g sh-en.autogen.bin
1088 I see gramophones
1089 </pre>
1090 <p>Which, while it isn't exactly prize-winning prose (much like
1091 this HOWTO), is a fairly accurate translation.</p>
1092 <h3>So tell me about the record player</h3>
1093 <p>While gramophone is an English word, it isn't the best
1094 translation. Gramophone is typically used for the very old kind,
1095 you know with the needle instead of the stylus, and no
1096 amplification. A better translation would be 'record player'.
1097 Although this is more than one word, we can treat it as if it is
1098 one word by using multiword (<i>multipalabra</i>)
1099 constructions.</p>
1100 <p>We don't need to touch the Serbo-Croatian dictionary, just the
1101 English one and the bilingual one this, so open it up.</p>
1102 <p>The plural of 'record player' is 'record players', so it takes
1103 the same paradigm as gramophone (gramophone__n) — in that we just
1104 add 's'. All we need to do is add a new element to the main
1105 section.</p>
1106 <p><tt>&lt;e lm="record
1107 player"&gt;&lt;i&gt;record&lt;b/&gt;player&lt;/i&gt;&lt;par
1108 n="gramophone__n"/&gt;&lt;/e&gt;</tt></p>
1109 <p>The only thing different about this is the use of the &lt;b/&gt;
1110 tag, although this isn't entirely new as we saw it in use in the
1111 rules file.</p>
1112 <p>So, recompile and test in the orthodox fashion:</p>
1113 <pre>
1114 $ echo "vidim gramofone" | lt-proc sh-en.automorf.bin | \
1115 perl -ne 's,^([^/]*/)(.*)$,\^\2,; s,^(.*\$\s\^)[^/]+/(.*)$,\1\2,; print' | \
1116 apertium-transfer apertium-sh-en.trules-sh-en.xml trules-sh-en.bin sh-en.autobil.bin | \
1117 lt-proc -g sh-en.autogen.bin
1118 I see record players
1119 </pre>
1120 <p>Perfect. A big benefit of using multiwords is that you can
1121 translate idiomatic expressions verbatim, without having to do
1122 word-by-word translation.</p>
1123 </body>
1124 </html>