lib/perl5/5.8.8/Locale/Maketext/TPJ13.pod

   1
   2 # This document contains text in Perl "POD" format.
   3 # Use a POD viewer like perldoc or perlman to render it.
   4
   5 # This corrects some typoes in the previous release.
   6
   7 =head1 NAME
   8
   9 Locale::Maketext::TPJ13 -- article about software localization
  10
  11 =head1 SYNOPSIS
  12
  13   # This an article, not a module.
  14
  15 =head1 DESCRIPTION
  16
  17 The following article by Sean M. Burke and Jordan Lachler
  18 first appeared in I<The Perl
  19 Journal> #13 and is copyright 1999 The Perl Journal. It appears
  20 courtesy of Jon Orwant and The Perl Journal.  This document may be
  21 distributed under the same terms as Perl itself.
  22
  23 =head1 Localization and Perl: gettext breaks, Maketext fixes
  24
  25 by Sean M. Burke and Jordan Lachler
  26
  27 This article points out cases where gettext (a common system for
  28 localizing software interfaces -- i.e., making them work in the user's
  29 language of choice) fails because of basic differences between human
  30 languages.  This article then describes Maketext, a new system capable
  31 of correctly treating these differences.
  32
  33 =head2 A Localization Horror Story: It Could Happen To You
  34
  35 =over
  36
  37 "There are a number of languages spoken by human beings in this
  38 world."
  39
  40 -- Harald Tveit Alvestrand, in RFC 1766, "Tags for the
  41 Identification of Languages"
  42
  43 =back
  44
  45 Imagine that your task for the day is to localize a piece of software
  46 -- and luckily for you, the only output the program emits is two
  47 messages, like this:
  48
  49   I scanned 12 directories.
  50
  51   Your query matched 10 files in 4 directories.
  52
  53 So how hard could that be?  You look at the code that
  54 produces the first item, and it reads:
  55
  56   printf("I scanned %g directories.",
  57          $directory_count);
  58
  59 You think about that, and realize that it doesn't even work right for
  60 English, as it can produce this output:
  61
  62   I scanned 1 directories.
  63
  64 So you rewrite it to read:
  65
  66   printf("I scanned %g %s.",
  67          $directory_count,
  68          $directory_count == 1 ?
  69            "directory" : "directories",
  70   );
  71
  72 ...which does the Right Thing.  (In case you don't recall, "%g" is for
  73 locale-specific number interpolation, and "%s" is for string
  74 interpolation.)
  75
  76 But you still have to localize it for all the languages you're
  77 producing this software for, so you pull Locale::gettext off of CPAN
  78 so you can access the C<gettext> C functions you've heard are standard
  79 for localization tasks.
  80
  81 And you write:
  82
  83   printf(gettext("I scanned %g %s."),
  84          $dir_scan_count,
  85          $dir_scan_count == 1 ?
  86            gettext("directory") : gettext("directories"),
  87   );
  88
  89 But you then read in the gettext manual (Drepper, Miller, and Pinard 1995)
  90 that this is not a good idea, since how a single word like "directory"
  91 or "directories" is translated may depend on context -- and this is
  92 true, since in a case language like German or Russian, you'd may need
  93 these words with a different case ending in the first instance (where the
  94 word is the object of a verb) than in the second instance, which you haven't even
  95 gotten to yet (where the word is the object of a preposition, "in %g
  96 directories") -- assuming these keep the same syntax when translated
  97 into those languages.
  98
  99 So, on the advice of the gettext manual, you rewrite:
 100
 101   printf( $dir_scan_count == 1 ?
 102            gettext("I scanned %g directory.") :
 103            gettext("I scanned %g directories."),
 104          $dir_scan_count );
 105
 106 So, you email your various translators (the boss decides that the
 107 languages du jour are Chinese, Arabic, Russian, and Italian, so you
 108 have one translator for each), asking for translations for "I scanned
 109 %g directory." and "I scanned %g directories.".  When they reply,
 110 you'll put that in the lexicons for gettext to use when it localizes
 111 your software, so that when the user is running under the "zh"
 112 (Chinese) locale, gettext("I scanned %g directory.") will return the
 113 appropriate Chinese text, with a "%g" in there where printf can then
 114 interpolate $dir_scan.
 115
 116 Your Chinese translator emails right back -- he says both of these
 117 phrases translate to the same thing in Chinese, because, in linguistic
 118 jargon, Chinese "doesn't have number as a grammatical category" --
 119 whereas English does.  That is, English has grammatical rules that
 120 refer to "number", i.e., whether something is grammatically singular
 121 or plural; and one of these rules is the one that forces nouns to take
 122 a plural suffix (generally "s") when in a plural context, as they are when
 123 they follow a number other than "one" (including, oddly enough, "zero").
 124 Chinese has no such rules, and so has just the one phrase where English
 125 has two.  But, no problem, you can have this one Chinese phrase appear
 126 as the translation for the two English phrases in the "zh" gettext
 127 lexicon for your program.
 128
 129 Emboldened by this, you dive into the second phrase that your software
 130 needs to output: "Your query matched 10 files in 4 directories.".  You notice
 131 that if you want to treat phrases as indivisible, as the gettext
 132 manual wisely advises, you need four cases now, instead of two, to
 133 cover the permutations of singular and plural on the two items,
 134 $dir_count and $file_count.  So you try this:
 135
 136   printf( $file_count == 1 ?
 137     ( $directory_count == 1 ?
 138      gettext("Your query matched %g file in %g directory.") :
 139      gettext("Your query matched %g file in %g directories.") ) :
 140     ( $directory_count == 1 ?
 141      gettext("Your query matched %g files in %g directory.") :
 142      gettext("Your query matched %g files in %g directories.") ),
 143    $file_count, $directory_count,
 144   );
 145
 146 (The case of "1 file in 2 [or more] directories" could, I suppose,
 147 occur in the case of symlinking or something of the sort.)
 148
 149 It occurs to you that this is not the prettiest code you've ever
 150 written, but this seems the way to go.  You mail off to the
 151 translators asking for translations for these four cases.  The
 152 Chinese guy replies with the one phrase that these all translate to in
 153 Chinese, and that phrase has two "%g"s in it, as it should -- but
 154 there's a problem.  He translates it word-for-word back: "In %g
 155 directories contains %g files match your query."  The %g
 156 slots are in an order reverse to what they are in English.  You wonder
 157 how you'll get gettext to handle that.
 158
 159 But you put it aside for the moment, and optimistically hope that the
 160 other translators won't have this problem, and that their languages
 161 will be better behaved -- i.e., that they will be just like English.
 162
 163 But the Arabic translator is the next to write back.  First off, your
 164 code for "I scanned %g directory." or "I scanned %g directories."
 165 assumes there's only singular or plural.  But, to use linguistic
 166 jargon again, Arabic has grammatical number, like English (but unlike
 167 Chinese), but it's a three-term category: singular, dual, and plural.
 168 In other words, the way you say "directory" depends on whether there's
 169 one directory, or I<two> of them, or I<more than two> of them.  Your
 170 test of C<($directory == 1)> no longer does the job.  And it means
 171 that where English's grammatical category of number necessitates
 172 only the two permutations of the first sentence based on "directory
 173 [singular]" and "directories [plural]", Arabic has three -- and,
 174 worse, in the second sentence ("Your query matched %g file in %g
 175 directory."), where English has four, Arabic has nine.  You sense
 176 an unwelcome, exponential trend taking shape.
 177
 178 Your Italian translator emails you back and says that "I searched 0
 179 directories" (a possible English output of your program) is stilted,
 180 and if you think that's fine English, that's your problem, but that
 181 I<just will not do> in the language of Dante.  He insists that where
 182 $directory_count is 0, your program should produce the Italian text
 183 for "I I<didn't> scan I<any> directories.".  And ditto for "I didn't
 184 match any files in any directories", although he says the last part
 185 about "in any directories" should probably just be left off.
 186
 187 You wonder how you'll get gettext to handle this; to accomodate the
 188 ways Arabic, Chinese, and Italian deal with numbers in just these few
 189 very simple phrases, you need to write code that will ask gettext for
 190 different queries depending on whether the numerical values in
 191 question are 1, 2, more than 2, or in some cases 0, and you still haven't
 192 figured out the problem with the different word order in Chinese.
 193
 194 Then your Russian translator calls on the phone, to I<personally> tell
 195 you the bad news about how really unpleasant your life is about to
 196 become:
 197
 198 Russian, like German or Latin, is an inflectional language; that is, nouns
 199 and adjectives have to take endings that depend on their case
 200 (i.e., nominative, accusative, genitive, etc...) -- which is roughly a matter of
 201 what role they have in syntax of the sentence --
 202 as well as on the grammatical gender (i.e., masculine, feminine, neuter)
 203 and number (i.e., singular or plural) of the noun, as well as on the
 204 declension class of the noun.  But unlike with most other inflected languages,
 205 putting a number-phrase (like "ten" or "forty-three", or their Arabic
 206 numeral equivalents) in front of noun in Russian can change the case and
 207 number that noun is, and therefore the endings you have to put on it.
 208
 209 He elaborates:  In "I scanned %g directories", you'd I<expect>
 210 "directories" to be in the accusative case (since it is the direct
 211 object in the sentnce) and the plural number,
 212 except where $directory_count is 1, then you'd expect the singular, of
 213 course.  Just like Latin or German.  I<But!>  Where $directory_count %
 214 10 is 1 ("%" for modulo, remember), assuming $directory count is an
 215 integer, and except where $directory_count % 100 is 11, "directories"
 216 is forced to become grammatically singular, which means it gets the
 217 ending for the accusative singular...  You begin to visualize the code
 218 it'd take to test for the problem so far, I<and still work for Chinese
 219 and Arabic and Italian>, and how many gettext items that'd take, but
 220 he keeps going...  But where $directory_count % 10 is 2, 3, or 4
 221 (except where $directory_count % 100 is 12, 13, or 14), the word for
 222 "directories" is forced to be genitive singular -- which means another
 223 ending... The room begins to spin around you, slowly at first...  But
 224 with I<all other> integer values, since "directory" is an inanimate
 225 noun, when preceded by a number and in the nominative or accusative
 226 cases (as it is here, just your luck!), it does stay plural, but it is
 227 forced into the genitive case -- yet another ending...  And
 228 you never hear him get to the part about how you're going to run into
 229 similar (but maybe subtly different) problems with other Slavic
 230 languages like Polish, because the floor comes up to meet you, and you
 231 fade into unconsciousness.
 232
 233
 234 The above cautionary tale relates how an attempt at localization can
 235 lead from programmer consternation, to program obfuscation, to a need
 236 for sedation.  But careful evaluation shows that your choice of tools
 237 merely needed further consideration.
 238
 239 =head2 The Linguistic View
 240
 241 =over
 242
 243 "It is more complicated than you think."
 244
 245 -- The Eighth Networking Truth, from RFC 1925
 246
 247 =back
 248
 249 The field of Linguistics has expended a great deal of effort over the
 250 past century trying to find grammatical patterns which hold across
 251 languages; it's been a constant process
 252 of people making generalizations that should apply to all languages,
 253 only to find out that, all too often, these generalizations fail --
 254 sometimes failing for just a few languages, sometimes whole classes of
 255 languages, and sometimes nearly every language in the world except
 256 English.  Broad statistical trends are evident in what the "average
 257 language" is like as far as what its rules can look like, must look
 258 like, and cannot look like.  But the "average language" is just as
 259 unreal a concept as the "average person" -- it runs up against the
 260 fact no language (or person) is, in fact, average.  The wisdom of past
 261 experience leads us to believe that any given language can do whatever
 262 it wants, in any order, with appeal to any kind of grammatical
 263 categories wants -- case, number, tense, real or metaphoric
 264 characteristics of the things that words refer to, arbitrary or
 265 predictable classifications of words based on what endings or prefixes
 266 they can take, degree or means of certainty about the truth of
 267 statements expressed, and so on, ad infinitum.
 268
 269 Mercifully, most localization tasks are a matter of finding ways to
 270 translate whole phrases, generally sentences, where the context is
 271 relatively set, and where the only variation in content is I<usually>
 272 in a number being expressed -- as in the example sentences above.
 273 Translating specific, fully-formed sentences is, in practice, fairly
 274 foolproof -- which is good, because that's what's in the phrasebooks
 275 that so many tourists rely on.  Now, a given phrase (whether in a
 276 phrasebook or in a gettext lexicon) in one language I<might> have a
 277 greater or lesser applicability than that phrase's translation into
 278 another language -- for example, strictly speaking, in Arabic, the
 279 "your" in "Your query matched..." would take a different form
 280 depending on whether the user is male or female; so the Arabic
 281 translation "your[feminine] query" is applicable in fewer cases than
 282 the corresponding English phrase, which doesn't distinguish the user's
 283 gender.  (In practice, it's not feasable to have a program know the
 284 user's gender, so the masculine "you" in Arabic is usually used, by
 285 default.)
 286
 287 But in general, such surprises are rare when entire sentences are
 288 being translated, especially when the functional context is restricted
 289 to that of a computer interacting with a user either to convey a fact
 290 or to prompt for a piece of information.  So, for purposes of
 291 localization, translation by phrase (generally by sentence) is both the
 292 simplest and the least problematic.
 293
 294 =head2 Breaking gettext
 295
 296 =over
 297
 298 "It Has To Work."
 299
 300 -- First Networking Truth, RFC 1925
 301
 302 =back
 303
 304 Consider that sentences in a tourist phrasebook are of two types: ones
 305 like "How do I get to the marketplace?" that don't have any blanks to
 306 fill in, and ones like "How much do these ___ cost?", where there's
 307 one or more blanks to fill in (and these are usually linked to a
 308 list of words that you can put in that blank: "fish", "potatoes",
 309 "tomatoes", etc.)  The ones with no blanks are no problem, but the
 310 fill-in-the-blank ones may not be really straightforward. If it's a
 311 Swahili phrasebook, for example, the authors probably didn't bother to
 312 tell you the complicated ways that the verb "cost" changes its
 313 inflectional prefix depending on the noun you're putting in the blank.
 314 The trader in the marketplace will still understand what you're saying if
 315 you say "how much do these potatoes cost?" with the wrong
 316 inflectional prefix on "cost".  After all, I<you> can't speak proper Swahili,
 317 I<you're> just a tourist.  But while tourists can be stupid, computers
 318 are supposed to be smart; the computer should be able to fill in the
 319 blank, and still have the results be grammatical.
 320
 321 In other words, a phrasebook entry takes some values as parameters
 322 (the things that you fill in the blank or blanks), and provides a value
 323 based on these parameters, where the way you get that final value from
 324 the given values can, properly speaking, involve an arbitrarily
 325 complex series of operations.  (In the case of Chinese, it'd be not at
 326 all complex, at least in cases like the examples at the beginning of
 327 this article; whereas in the case of Russian it'd be a rather complex
 328 series of operations.  And in some languages, the
 329 complexity could be spread around differently: while the act of
 330 putting a number-expression in front of a noun phrase might not be
 331 complex by itself, it may change how you have to, for example, inflect
 332 a verb elsewhere in the sentence.  This is what in syntax is called
 333 "long-distance dependencies".)
 334
 335 This talk of parameters and arbitrary complexity is just another way
 336 to say that an entry in a phrasebook is what in a programming language
 337 would be called a "function".  Just so you don't miss it, this is the
 338 crux of this article: I<A phrase is a function; a phrasebook is a
 339 bunch of functions.>
 340
 341 The reason that using gettext runs into walls (as in the above
 342 second-person horror story) is that you're trying to use a string (or
 343 worse, a choice among a bunch of strings) to do what you really need a
 344 function for -- which is futile.  Preforming (s)printf interpolation
 345 on the strings which you get back from gettext does allow you to do I<some>
 346 common things passably well... sometimes... sort of; but, to paraphrase
 347 what some people say about C<csh> script programming, "it fools you
 348 into thinking you can use it for real things, but you can't, and you
 349 don't discover this until you've already spent too much time trying,
 350 and by then it's too late."
 351
 352 =head2 Replacing gettext
 353
 354 So, what needs to replace gettext is a system that supports lexicons
 355 of functions instead of lexicons of strings.  An entry in a lexicon
 356 from such a system should I<not> look like this:
 357
 358   "J'ai trouv\xE9 %g fichiers dans %g r\xE9pertoires"
 359
 360 [\xE9 is e-acute in Latin-1.  Some pod renderers would
 361 scream if I used the actual character here. -- SB]
 362
 363 but instead like this, bearing in mind that this is just a first stab:
 364
 365   sub I_found_X1_files_in_X2_directories {
 366     my( $files, $dirs ) = @_[0,1];
 367     $files = sprintf("%g %s", $files,
 368       $files == 1 ? 'fichier' : 'fichiers');
 369     $dirs = sprintf("%g %s", $dirs,
 370       $dirs == 1 ? "r\xE9pertoire" : "r\xE9pertoires");
 371     return "J'ai trouv\xE9 $files dans $dirs.";
 372   }
 373
 374 Now, there's no particularly obvious way to store anything but strings
 375 in a gettext lexicon; so it looks like we just have to start over and
 376 make something better, from scratch.  I call my shot at a
 377 gettext-replacement system "Maketext", or, in CPAN terms,
 378 Locale::Maketext.
 379
 380 When designing Maketext, I chose to plan its main features in terms of
 381 "buzzword compliance".  And here are the buzzwords:
 382
 383 =head2 Buzzwords: Abstraction and Encapsulation
 384
 385 The complexity of the language you're trying to output a phrase in is
 386 entirely abstracted inside (and encapsulated within) the Maketext module
 387 for that interface.  When you call:
 388
 389   print $lang->maketext("You have [quant,_1,piece] of new mail.",
 390                        scalar(@messages));
 391
 392 you don't know (and in fact can't easily find out) whether this will
 393 involve lots of figuring, as in Russian (if $lang is a handle to the
 394 Russian module), or relatively little, as in Chinese.  That kind of
 395 abstraction and encapsulation may encourage other pleasant buzzwords
 396 like modularization and stratification, depending on what design
 397 decisions you make.
 398
 399 =head2 Buzzword: Isomorphism
 400
 401 "Isomorphism" means "having the same structure or form"; in discussions
 402 of program design, the word takes on the special, specific meaning that
 403 your implementation of a solution to a problem I<has the same
 404 structure> as, say, an informal verbal description of the solution, or
 405 maybe of the problem itself.  Isomorphism is, all things considered,
 406 a good thing -- it's what problem-solving (and solution-implementing)
 407 should look like.
 408
 409 What's wrong the with gettext-using code like this...
 410
 411   printf( $file_count == 1 ?
 412     ( $directory_count == 1 ?
 413      "Your query matched %g file in %g directory." :
 414      "Your query matched %g file in %g directories." ) :
 415     ( $directory_count == 1 ?
 416      "Your query matched %g files in %g directory." :
 417      "Your query matched %g files in %g directories." ),
 418    $file_count, $directory_count,
 419   );
 420
 421 is first off that it's not well abstracted -- these ways of testing
 422 for grammatical number (as in the expressions like C<foo == 1 ?
 423 singular_form : plural_form>) should be abstracted to each language
 424 module, since how you get grammatical number is language-specific.
 425
 426 But second off, it's not isomorphic -- the "solution" (i.e., the
 427 phrasebook entries) for Chinese maps from these four English phrases to
 428 the one Chinese phrase that fits for all of them.  In other words, the
 429 informal solution would be "The way to say what you want in Chinese is
 430 with the one phrase 'For your question, in Y directories you would
 431 find X files'" -- and so the implemented solution should be,
 432 isomorphically, just a straightforward way to spit out that one
 433 phrase, with numerals properly interpolated.  It shouldn't have to map
 434 from the complexity of other languages to the simplicity of this one.
 435
 436 =head2 Buzzword: Inheritance
 437
 438 There's a great deal of reuse possible for sharing of phrases between
 439 modules for related dialects, or for sharing of auxiliary functions
 440 between related languages.  (By "auxiliary functions", I mean
 441 functions that don't produce phrase-text, but which, say, return an
 442 answer to "does this number require a plural noun after it?".  Such
 443 auxiliary functions would be used in the internal logic of functions
 444 that actually do produce phrase-text.)
 445
 446 In the case of sharing phrases, consider that you have an interface
 447 already localized for American English (probably by having been
 448 written with that as the native locale, but that's incidental).
 449 Localizing it for UK English should, in practical terms, be just a
 450 matter of running it past a British person with the instructions to
 451 indicate what few phrases would benefit from a change in spelling or
 452 possibly minor rewording.  In that case, you should be able to put in
 453 the UK English localization module I<only> those phrases that are
 454 UK-specific, and for all the rest, I<inherit> from the American
 455 English module.  (And I expect this same situation would apply with
 456 Brazilian and Continental Portugese, possbily with some I<very>
 457 closely related languages like Czech and Slovak, and possibly with the
 458 slightly different "versions" of written Mandarin Chinese, as I hear exist in
 459 Taiwan and mainland China.)
 460
 461 As to sharing of auxiliary functions, consider the problem of Russian
 462 numbers from the beginning of this article; obviously, you'd want to
 463 write only once the hairy code that, given a numeric value, would
 464 return some specification of which case and number a given quanitified
 465 noun should use.  But suppose that you discover, while localizing an
 466 interface for, say, Ukranian (a Slavic language related to Russian,
 467 spoken by several million people, many of whom would be relieved to
 468 find that your Web site's or software's interface is available in
 469 their language), that the rules in Ukranian are the same as in Russian
 470 for quantification, and probably for many other grammatical functions.
 471 While there may well be no phrases in common between Russian and
 472 Ukranian, you could still choose to have the Ukranian module inherit
 473 from the Russian module, just for the sake of inheriting all the
 474 various grammatical methods.  Or, probably better organizationally,
 475 you could move those functions to a module called C<_E_Slavic> or
 476 something, which Russian and Ukranian could inherit useful functions
 477 from, but which would (presumably) provide no lexicon.
 478
 479 =head2 Buzzword: Concision
 480
 481 Okay, concision isn't a buzzword.  But it should be, so I decree that
 482 as a new buzzword, "concision" means that simple common things should
 483 be expressible in very few lines (or maybe even just a few characters)
 484 of code -- call it a special case of "making simple things easy and
 485 hard things possible", and see also the role it played in the
 486 MIDI::Simple language, discussed elsewhere in this issue [TPJ#13].
 487
 488 Consider our first stab at an entry in our "phrasebook of functions":
 489
 490   sub I_found_X1_files_in_X2_directories {
 491     my( $files, $dirs ) = @_[0,1];
 492     $files = sprintf("%g %s", $files,
 493       $files == 1 ? 'fichier' : 'fichiers');
 494     $dirs = sprintf("%g %s", $dirs,
 495       $dirs == 1 ? "r\xE9pertoire" : "r\xE9pertoires");
 496     return "J'ai trouv\xE9 $files dans $dirs.";
 497   }
 498
 499 You may sense that a lexicon (to use a non-committal catch-all term for a
 500 collection of things you know how to say, regardless of whether they're
 501 phrases or words) consisting of functions I<expressed> as above would
 502 make for rather long-winded and repetitive code -- even if you wisely
 503 rewrote this to have quantification (as we call adding a number
 504 expression to a noun phrase) be a function called like:
 505
 506   sub I_found_X1_files_in_X2_directories {
 507     my( $files, $dirs ) = @_[0,1];
 508     $files = quant($files, "fichier");
 509     $dirs =  quant($dirs,  "r\xE9pertoire");
 510     return "J'ai trouv\xE9 $files dans $dirs.";
 511   }
 512
 513 And you may also sense that you do not want to bother your translators
 514 with having to write Perl code -- you'd much rather that they spend
 515 their I<very costly time> on just translation.  And this is to say
 516 nothing of the near impossibility of finding a commercial translator
 517 who would know even simple Perl.
 518
 519 In a first-hack implementation of Maketext, each language-module's
 520 lexicon looked like this:
 521
 522  %Lexicon = (
 523    "I found %g files in %g directories"
 524    => sub {
 525       my( $files, $dirs ) = @_[0,1];
 526       $files = quant($files, "fichier");
 527       $dirs =  quant($dirs,  "r\xE9pertoire");
 528       return "J'ai trouv\xE9 $files dans $dirs.";
 529     },
 530   ... and so on with other phrase => sub mappings ...
 531  );
 532
 533 but I immediately went looking for some more concise way to basically
 534 denote the same phrase-function -- a way that would also serve to
 535 concisely denote I<most> phrase-functions in the lexicon for I<most>
 536 languages.  After much time and even some actual thought, I decided on
 537 this system:
 538
 539 * Where a value in a %Lexicon hash is a contentful string instead of
 540 an anonymous sub (or, conceivably, a coderef), it would be interpreted
 541 as a sort of shorthand expression of what the sub does.  When accessed
 542 for the first time in a session, it is parsed, turned into Perl code,
 543 and then eval'd into an anonymous sub; then that sub replaces the
 544 original string in that lexicon.  (That way, the work of parsing and
 545 evaling the shorthand form for a given phrase is done no more than
 546 once per session.)
 547
 548 * Calls to C<maketext> (as Maketext's main function is called) happen
 549 thru a "language session handle", notionally very much like an IO
 550 handle, in that you open one at the start of the session, and use it
 551 for "sending signals" to an object in order to have it return the text
 552 you want.
 553
 554 So, this:
 555
 556   $lang->maketext("You have [quant,_1,piece] of new mail.",
 557                  scalar(@messages));
 558
 559 basically means this: look in the lexicon for $lang (which may inherit
 560 from any number of other lexicons), and find the function that we
 561 happen to associate with the string "You have [quant,_1,piece] of new
 562 mail" (which is, and should be, a functioning "shorthand" for this
 563 function in the native locale -- English in this case).  If you find
 564 such a function, call it with $lang as its first parameter (as if it
 565 were a method), and then a copy of scalar(@messages) as its second,
 566 and then return that value.  If that function was found, but was in
 567 string shorthand instead of being a fully specified function, parse it
 568 and make it into a function before calling it the first time.
 569
 570 * The shorthand uses code in brackets to indicate method calls that
 571 should be performed.  A full explanation is not in order here, but a
 572 few examples will suffice:
 573
 574   "You have [quant,_1,piece] of new mail."
 575
 576 The above code is shorthand for, and will be interpreted as,
 577 this:
 578
 579   sub {
 580     my $handle = $_[0];
 581     my(@params) = @_;
 582     return join '',
 583       "You have ",
 584       $handle->quant($params[1], 'piece'),
 585       "of new mail.";
 586   }
 587
 588 where "quant" is the name of a method you're using to quantify the
 589 noun "piece" with the number $params[0].
 590
 591 A string with no brackety calls, like this:
 592
 593   "Your search expression was malformed."
 594
 595 is somewhat of a degerate case, and just gets turned into:
 596
 597   sub { return "Your search expression was malformed." }
 598
 599 However, not everything you can write in Perl code can be written in
 600 the above shorthand system -- not by a long shot.  For example, consider
 601 the Italian translator from the beginning of this article, who wanted
 602 the Italian for "I didn't find any files" as a special case, instead
 603 of "I found 0 files".  That couldn't be specified (at least not easily
 604 or simply) in our shorthand system, and it would have to be written
 605 out in full, like this:
 606
 607   sub {  # pretend the English strings are in Italian
 608     my($handle, $files, $dirs) = @_[0,1,2];
 609     return "I didn't find any files" unless $files;
 610     return join '',
 611       "I found ",
 612       $handle->quant($files, 'file'),
 613       " in ",
 614       $handle->quant($dirs,  'directory'),
 615       ".";
 616   }
 617
 618 Next to a lexicon full of shorthand code, that sort of sticks out like a
 619 sore thumb -- but this I<is> a special case, after all; and at least
 620 it's possible, if not as concise as usual.
 621
 622 As to how you'd implement the Russian example from the beginning of
 623 the article, well, There's More Than One Way To Do It, but it could be
 624 something like this (using English words for Russian, just so you know
 625 what's going on):
 626
 627   "I [quant,_1,directory,accusative] scanned."
 628
 629 This shifts the burden of complexity off to the quant method.  That
 630 method's parameters are: the numeric value it's going to use to
 631 quantify something; the Russian word it's going to quantify; and the
 632 parameter "accusative", which you're using to mean that this
 633 sentence's syntax wants a noun in the accusative case there, although
 634 that quantification method may have to overrule, for grammatical
 635 reasons you may recall from the beginning of this article.
 636
 637 Now, the Russian quant method here is responsible not only for
 638 implementing the strange logic necessary for figuring out how Russian
 639 number-phrases impose case and number on their noun-phrases, but also
 640 for inflecting the Russian word for "directory".  How that inflection
 641 is to be carried out is no small issue, and among the solutions I've
 642 seen, some (like variations on a simple lookup in a hash where all
 643 possible forms are provided for all necessary words) are
 644 straightforward but I<can> become cumbersome when you need to inflect
 645 more than a few dozen words; and other solutions (like using
 646 algorithms to model the inflections, storing only root forms and
 647 irregularities) I<can> involve more overhead than is justifiable for
 648 all but the largest lexicons.
 649
 650 Mercifully, this design decision becomes crucial only in the hairiest
 651 of inflected languages, of which Russian is by no means the I<worst> case
 652 scenario, but is worse than most.  Most languages have simpler
 653 inflection systems; for example, in English or Swahili, there are
 654 generally no more than two possible inflected forms for a given noun
 655 ("error/errors"; "kosa/makosa"), and the
 656 rules for producing these forms are fairly simple -- or at least,
 657 simple rules can be formulated that work for most words, and you can
 658 then treat the exceptions as just "irregular", at least relative to
 659 your ad hoc rules.  A simpler inflection system (simpler rules, fewer
 660 forms) means that design decisions are less crucial to maintaining
 661 sanity, whereas the same decisions could incur
 662 overhead-versus-scalability problems in languages like Russian.  It
 663 may I<also> be likely that code (possibly in Perl, as with
 664 Lingua::EN::Inflect, for English nouns) has already
 665 been written for the language in question, whether simple or complex.
 666
 667 Moreover, a third possibility may even be simpler than anything
 668 discussed above: "Just require that all possible (or at least
 669 applicable) forms be provided in the call to the given language's quant
 670 method, as in:"
 671
 672   "I found [quant,_1,file,files]."
 673
 674 That way, quant just has to chose which form it needs, without having
 675 to look up or generate anything.  While possibly not optimal for
 676 Russian, this should work well for most other languages, where
 677 quantification is not as complicated an operation.
 678
 679 =head2 The Devil in the Details
 680
 681 There's plenty more to Maketext than described above -- for example,
 682 there's the details of how language tags ("en-US", "i-pwn", "fi",
 683 etc.) or locale IDs ("en_US") interact with actual module naming
 684 ("BogoQuery/Locale/en_us.pm"), and what magic can ensue; there's the
 685 details of how to record (and possibly negotiate) what character
 686 encoding Maketext will return text in (UTF8? Latin-1? KOI8?).  There's
 687 the interesting fact that Maketext is for localization, but nowhere
 688 actually has a "C<use locale;>" anywhere in it.  For the curious,
 689 there's the somewhat frightening details of how I actually
 690 implement something like data inheritance so that searches across
 691 modules' %Lexicon hashes can parallel how Perl implements method
 692 inheritance.
 693
 694 And, most importantly, there's all the practical details of how to
 695 actually go about deriving from Maketext so you can use it for your
 696 interfaces, and the various tools and conventions for starting out and
 697 maintaining individual language modules.
 698
 699 That is all covered in the documentation for Locale::Maketext and the
 700 modules that come with it, available in CPAN.  After having read this
 701 article, which covers the why's of Maketext, the documentation,
 702 which covers the how's of it, should be quite straightfoward.
 703
 704 =head2 The Proof in the Pudding: Localizing Web Sites
 705
 706 Maketext and gettext have a notable difference: gettext is in C,
 707 accessible thru C library calls, whereas Maketext is in Perl, and
 708 really can't work without a Perl interpreter (although I suppose
 709 something like it could be written for C).  Accidents of history (and
 710 not necessarily lucky ones) have made C++ the most common language for
 711 the implementation of applications like word processors, Web browsers,
 712 and even many in-house applications like custom query systems.  Current
 713 conditions make it somewhat unlikely that the next one of any of these
 714 kinds of applications will be written in Perl, albeit clearly more for
 715 reasons of custom and inertia than out of consideration of what is the
 716 right tool for the job.
 717
 718 However, other accidents of history have made Perl a well-accepted
 719 language for design of server-side programs (generally in CGI form)
 720 for Web site interfaces.  Localization of static pages in Web sites is
 721 trivial, feasable either with simple language-negotiation features in
 722 servers like Apache, or with some kind of server-side inclusions of
 723 language-appropriate text into layout templates.  However, I think
 724 that the localization of Perl-based search systems (or other kinds of
 725 dynamic content) in Web sites, be they public or access-restricted,
 726 is where Maketext will see the greatest use.
 727
 728 I presume that it would be only the exceptional Web site that gets
 729 localized for English I<and> Chinese I<and> Italian I<and> Arabic
 730 I<and> Russian, to recall the languages from the beginning of this
 731 article -- to say nothing of German, Spanish, French, Japanese,
 732 Finnish, and Hindi, to name a few languages that benefit from large
 733 numbers of programmers or Web viewers or both.
 734
 735 However, the ever-increasing internationalization of the Web (whether
 736 measured in terms of amount of content, of numbers of content writers
 737 or programmers, or of size of content audiences) makes it increasingly
 738 likely that the interface to the average Web-based dynamic content
 739 service will be localized for two or maybe three languages.  It is my
 740 hope that Maketext will make that task as simple as possible, and will
 741 remove previous barriers to localization for languages dissimilar to
 742 English.
 743
 744  __END__
 745
 746 Sean M. Burke (sburkeE<64>cpan.org) has a Master's in linguistics
 747 from Northwestern University; he specializes in language technology.
 748 Jordan Lachler (lachlerE<64>unm.edu) is a PhD student in the Department of
 749 Linguistics at the University of New Mexico; he specializes in
 750 morphology and pedagogy of North American native languages.
 751
 752 =head2 References
 753
 754 Alvestrand, Harald Tveit.  1995.  I<RFC 1766: Tags for the
 755 Identification of Languages.>
 756 C<ftp://ftp.isi.edu/in-notes/rfc1766.txt>
 757 [Now see RFC 3066.]
 758
 759 Callon, Ross, editor.  1996.  I<RFC 1925: The Twelve
 760 Networking Truths.>
 761 C<ftp://ftp.isi.edu/in-notes/rfc1925.txt>
 762
 763 Drepper, Ulrich, Peter Miller,
 764 and FranE<ccedil>ois Pinard.  1995-2001.  GNU
 765 C<gettext>.  Available in C<ftp://prep.ai.mit.edu/pub/gnu/>, with
 766 extensive docs in the distribution tarball.  [Since
 767 I wrote this article in 1998, I now see that the
 768 gettext docs are now trying more to come to terms with
 769 plurality.  Whether useful conclusions have come from it
 770 is another question altogether. -- SMB, May 2001]
 771
 772 Forbes, Nevill.  1964.  I<Russian Grammar.>  Third Edition, revised
 773 by J. C. Dumbreck.  Oxford University Press.
 774
 775 =cut
 776
 777 #End
 778