mingw/html/lib/Unicode/UCD.html

   1 <?xml version="1.0" ?>
   2 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
   3 <html xmlns="http://www.w3.org/1999/xhtml">
   4 <head>
   5 <title>Unicode::UCD - Unicode character database</title>
   6 <meta http-equiv="content-type" content="text/html; charset=utf-8" />
   7 <link rev="made" href="mailto:" />
   8 </head>
   9
  10 <body style="background-color: white">
  11 <table border="0" width="100%" cellspacing="0" cellpadding="3">
  12 <tr><td class="block" style="background-color: #cccccc" valign="middle">
  13 <big><strong><span class="block">&nbsp;Unicode::UCD - Unicode character database</span></strong></big>
  14 </td></tr>
  15 </table>
  16
  17 <p><a name="__index__"></a></p>
  18 <!-- INDEX BEGIN -->
  19
  20 <ul>
  21
  22         <li><a href="#name">NAME</a></li>
  23         <li><a href="#synopsis">SYNOPSIS</a></li>
  24         <li><a href="#description">DESCRIPTION</a></li>
  25         <ul>
  26
  27                 <li><a href="#charinfo">charinfo</a></li>
  28                 <li><a href="#charblock">charblock</a></li>
  29                 <li><a href="#charscript">charscript</a></li>
  30                 <li><a href="#charblocks">charblocks</a></li>
  31                 <li><a href="#charscripts">charscripts</a></li>
  32                 <li><a href="#blocks_versus_scripts">Blocks versus Scripts</a></li>
  33                 <li><a href="#matching_scripts_and_blocks">Matching Scripts and Blocks</a></li>
  34                 <li><a href="#code_point_arguments">Code Point Arguments</a></li>
  35                 <li><a href="#charinrange">charinrange</a></li>
  36                 <li><a href="#compexcl">compexcl</a></li>
  37                 <li><a href="#casefold">casefold</a></li>
  38                 <li><a href="#casespec">casespec</a></li>
  39                 <li><a href="#namedseq__"><code>namedseq()</code></a></li>
  40                 <li><a href="#unicode__ucd__unicodeversion">Unicode::UCD::UnicodeVersion</a></li>
  41                 <li><a href="#implementation_note">Implementation Note</a></li>
  42         </ul>
  43
  44         <li><a href="#bugs">BUGS</a></li>
  45         <li><a href="#author">AUTHOR</a></li>
  46 </ul>
  47 <!-- INDEX END -->
  48
  49 <hr />
  50 <p>
  51 </p>
  52 <h1><a name="name">NAME</a></h1>
  53 <p>Unicode::UCD - Unicode character database</p>
  54 <p>
  55 </p>
  56 <hr />
  57 <h1><a name="synopsis">SYNOPSIS</a></h1>
  58 <pre>
  59     use Unicode::UCD 'charinfo';
  60     my $charinfo   = charinfo($codepoint);</pre>
  61 <pre>
  62     use Unicode::UCD 'charblock';
  63     my $charblock  = charblock($codepoint);</pre>
  64 <pre>
  65     use Unicode::UCD 'charscript';
  66     my $charscript = charscript($codepoint);</pre>
  67 <pre>
  68     use Unicode::UCD 'charblocks';
  69     my $charblocks = charblocks();</pre>
  70 <pre>
  71     use Unicode::UCD 'charscripts';
  72     my %charscripts = charscripts();</pre>
  73 <pre>
  74     use Unicode::UCD qw(charscript charinrange);
  75     my $range = charscript($script);
  76     print &quot;looks like $script\n&quot; if charinrange($range, $codepoint);</pre>
  77 <pre>
  78     use Unicode::UCD 'compexcl';
  79     my $compexcl = compexcl($codepoint);</pre>
  80 <pre>
  81     use Unicode::UCD 'namedseq';
  82     my $namedseq = namedseq($named_sequence_name);</pre>
  83 <pre>
  84     my $unicode_version = Unicode::UCD::UnicodeVersion();</pre>
  85 <p>
  86 </p>
  87 <hr />
  88 <h1><a name="description">DESCRIPTION</a></h1>
  89 <p>The Unicode::UCD module offers a simple interface to the Unicode
  90 Character Database.</p>
  91 <p>
  92 </p>
  93 <h2><a name="charinfo">charinfo</a></h2>
  94 <pre>
  95     use Unicode::UCD 'charinfo';</pre>
  96 <pre>
  97     my $charinfo = charinfo(0x41);</pre>
  98 <p><code>charinfo()</code> returns a reference to a hash that has the following fields
  99 as defined by the Unicode standard:</p>
 100 <pre>
 101     key</pre>
 102 <pre>
 103     code             code point with at least four hexdigits
 104     name             name of the character IN UPPER CASE
 105     category         general category of the character
 106     combining        classes used in the Canonical Ordering Algorithm
 107     bidi             bidirectional category
 108     decomposition    character decomposition mapping
 109     decimal          if decimal digit this is the integer numeric value
 110     digit            if digit this is the numeric value
 111     numeric          if numeric is the integer or rational numeric value
 112     mirrored         if mirrored in bidirectional text
 113     unicode10        Unicode 1.0 name if existed and different
 114     comment          ISO 10646 comment field
 115     upper            uppercase equivalent mapping
 116     lower            lowercase equivalent mapping
 117     title            titlecase equivalent mapping</pre>
 118 <pre>
 119     block            block the character belongs to (used in \p{In...})
 120     script           script the character belongs to</pre>
 121 <p>If no match is found, a reference to an empty hash is returned.</p>
 122 <p>The <code>block</code> property is the same as returned by charinfo().  It is
 123 not defined in the Unicode Character Database proper (Chapter 4 of the
 124 Unicode 3.0 Standard, aka TUS3) but instead in an auxiliary database
 125 (Chapter 14 of TUS3).  Similarly for the <code>script</code> property.</p>
 126 <p>Note that you cannot do (de)composition and casing based solely on the
 127 above <code>decomposition</code> and <code>lower</code>, <code>upper</code>, <code>title</code>, properties,
 128 you will need also the compexcl(), casefold(), and <code>casespec()</code> functions.</p>
 129 <p>
 130 </p>
 131 <h2><a name="charblock">charblock</a></h2>
 132 <pre>
 133     use Unicode::UCD 'charblock';</pre>
 134 <pre>
 135     my $charblock = charblock(0x41);
 136     my $charblock = charblock(1234);
 137     my $charblock = charblock(&quot;0x263a&quot;);
 138     my $charblock = charblock(&quot;U+263a&quot;);</pre>
 139 <pre>
 140     my $range     = charblock('Armenian');</pre>
 141 <p>With a <strong>code point argument</strong> <code>charblock()</code> returns the <em>block</em> the character
 142 belongs to, e.g.  <code>Basic Latin</code>.  Note that not all the character
 143 positions within all blocks are defined.</p>
 144 <p>See also <a href="#blocks_versus_scripts">Blocks versus Scripts</a>.</p>
 145 <p>If supplied with an argument that can't be a code point, <code>charblock()</code> tries
 146 to do the opposite and interpret the argument as a character block. The
 147 return value is a <em>range</em>: an anonymous list of lists that contain
 148 <em>start-of-range</em>, <em>end-of-range</em> code point pairs. You can test whether
 149 a code point is in a range using the <a href="#charinrange">charinrange</a> function. If the
 150 argument is not a known character block, <a href="file://C|\msysgit\mingw\html/pod/perlfunc.html#item_undef"><code>undef</code></a> is returned.</p>
 151 <p>
 152 </p>
 153 <h2><a name="charscript">charscript</a></h2>
 154 <pre>
 155     use Unicode::UCD 'charscript';</pre>
 156 <pre>
 157     my $charscript = charscript(0x41);
 158     my $charscript = charscript(1234);
 159     my $charscript = charscript(&quot;U+263a&quot;);</pre>
 160 <pre>
 161     my $range      = charscript('Thai');</pre>
 162 <p>With a <strong>code point argument</strong> <code>charscript()</code> returns the <em>script</em> the
 163 character belongs to, e.g.  <code>Latin</code>, <code>Greek</code>, <code>Han</code>.</p>
 164 <p>See also <a href="#blocks_versus_scripts">Blocks versus Scripts</a>.</p>
 165 <p>If supplied with an argument that can't be a code point, <code>charscript()</code> tries
 166 to do the opposite and interpret the argument as a character script. The
 167 return value is a <em>range</em>: an anonymous list of lists that contain
 168 <em>start-of-range</em>, <em>end-of-range</em> code point pairs. You can test whether a
 169 code point is in a range using the <a href="#charinrange">charinrange</a> function. If the
 170 argument is not a known character script, <a href="file://C|\msysgit\mingw\html/pod/perlfunc.html#item_undef"><code>undef</code></a> is returned.</p>
 171 <p>
 172 </p>
 173 <h2><a name="charblocks">charblocks</a></h2>
 174 <pre>
 175     use Unicode::UCD 'charblocks';</pre>
 176 <pre>
 177     my $charblocks = charblocks();</pre>
 178 <p><code>charblocks()</code> returns a reference to a hash with the known block names
 179 as the keys, and the code point ranges (see <a href="#charblock">charblock</a>) as the values.</p>
 180 <p>See also <a href="#blocks_versus_scripts">Blocks versus Scripts</a>.</p>
 181 <p>
 182 </p>
 183 <h2><a name="charscripts">charscripts</a></h2>
 184 <pre>
 185     use Unicode::UCD 'charscripts';</pre>
 186 <pre>
 187     my %charscripts = charscripts();</pre>
 188 <p><code>charscripts()</code> returns a hash with the known script names as the keys,
 189 and the code point ranges (see <a href="#charscript">charscript</a>) as the values.</p>
 190 <p>See also <a href="#blocks_versus_scripts">Blocks versus Scripts</a>.</p>
 191 <p>
 192 </p>
 193 <h2><a name="blocks_versus_scripts">Blocks versus Scripts</a></h2>
 194 <p>The difference between a block and a script is that scripts are closer
 195 to the linguistic notion of a set of characters required to present
 196 languages, while block is more of an artifact of the Unicode character
 197 numbering and separation into blocks of (mostly) 256 characters.</p>
 198 <p>For example the Latin <strong>script</strong> is spread over several <strong>blocks</strong>, such
 199 as <code>Basic Latin</code>, <code>Latin 1 Supplement</code>, <code>Latin Extended-A</code>, and
 200 <code>Latin Extended-B</code>.  On the other hand, the Latin script does not
 201 contain all the characters of the <code>Basic Latin</code> block (also known as
 202 the ASCII): it includes only the letters, and not, for example, the digits
 203 or the punctuation.</p>
 204 <p>For blocks see <a href="http://www.unicode.org/Public/UNIDATA/Blocks.txt">http://www.unicode.org/Public/UNIDATA/Blocks.txt</a></p>
 205 <p>For scripts see UTR #24: <a href="http://www.unicode.org/unicode/reports/tr24/">http://www.unicode.org/unicode/reports/tr24/</a></p>
 206 <p>
 207 </p>
 208 <h2><a name="matching_scripts_and_blocks">Matching Scripts and Blocks</a></h2>
 209 <p>Scripts are matched with the regular-expression construct
 210 <code>\p{...}</code> (e.g. <code>\p{Tibetan}</code> matches characters of the Tibetan script),
 211 while <code>\p{In...}</code> is used for blocks (e.g. <code>\p{InTibetan}</code> matches
 212 any of the 256 code points in the Tibetan block).</p>
 213 <p>
 214 </p>
 215 <h2><a name="code_point_arguments">Code Point Arguments</a></h2>
 216 <p>A <em>code point argument</em> is either a decimal or a hexadecimal scalar
 217 designating a Unicode character, or <code>U+</code> followed by hexadecimals
 218 designating a Unicode character.  In other words, if you want a code
 219 point to be interpreted as a hexadecimal number, you must prefix it
 220 with either <code>0x</code> or <code>U+</code>, because a string like e.g. <code>123</code> will
 221 be interpreted as a decimal code point.  Also note that Unicode is
 222 <strong>not</strong> limited to 16 bits (the number of Unicode characters is
 223 open-ended, in theory unlimited): you may have more than 4 hexdigits.</p>
 224 <p>
 225 </p>
 226 <h2><a name="charinrange">charinrange</a></h2>
 227 <p>In addition to using the <code>\p{In...}</code> and <code>\P{In...}</code> constructs, you
 228 can also test whether a code point is in the <em>range</em> as returned by
 229 <a href="#charblock">charblock</a> and <a href="#charscript">charscript</a> or as the values of the hash returned
 230 by <a href="#charblocks">charblocks</a> and <a href="#charscripts">charscripts</a> by using charinrange():</p>
 231 <pre>
 232     use Unicode::UCD qw(charscript charinrange);</pre>
 233 <pre>
 234     $range = charscript('Hiragana');
 235     print &quot;looks like hiragana\n&quot; if charinrange($range, $codepoint);</pre>
 236 <p>
 237 </p>
 238 <h2><a name="compexcl">compexcl</a></h2>
 239 <pre>
 240     use Unicode::UCD 'compexcl';</pre>
 241 <pre>
 242     my $compexcl = compexcl(&quot;09dc&quot;);</pre>
 243 <p>The <code>compexcl()</code> returns the composition exclusion (that is, if the
 244 character should not be produced during a precomposition) of the
 245 character specified by a <strong>code point argument</strong>.</p>
 246 <p>If there is a composition exclusion for the character, true is
 247 returned.  Otherwise, false is returned.</p>
 248 <p>
 249 </p>
 250 <h2><a name="casefold">casefold</a></h2>
 251 <pre>
 252     use Unicode::UCD 'casefold';</pre>
 253 <pre>
 254     my $casefold = casefold(&quot;00DF&quot;);</pre>
 255 <p>The <code>casefold()</code> returns the locale-independent case folding of the
 256 character specified by a <strong>code point argument</strong>.</p>
 257 <p>If there is a case folding for that character, a reference to a hash
 258 with the following fields is returned:</p>
 259 <pre>
 260     key</pre>
 261 <pre>
 262     code             code point with at least four hexdigits
 263     status           &quot;C&quot;, &quot;F&quot;, &quot;S&quot;, or &quot;I&quot;
 264     mapping          one or more codes separated by spaces</pre>
 265 <p>The meaning of the <em>status</em> is as follows:</p>
 266 <pre>
 267    C                 common case folding, common mappings shared
 268                      by both simple and full mappings
 269    F                 full case folding, mappings that cause strings
 270                      to grow in length. Multiple characters are separated
 271                      by spaces
 272    S                 simple case folding, mappings to single characters
 273                      where different from F
 274    I                 special case for dotted uppercase I and
 275                      dotless lowercase i
 276                      - If this mapping is included, the result is
 277                        case-insensitive, but dotless and dotted I's
 278                        are not distinguished
 279                      - If this mapping is excluded, the result is not
 280                        fully case-insensitive, but dotless and dotted
 281                        I's are distinguished</pre>
 282 <p>If there is no case folding for that character, <a href="file://C|\msysgit\mingw\html/pod/perlfunc.html#item_undef"><code>undef</code></a> is returned.</p>
 283 <p>For more information about case mappings see
 284 <a href="http://www.unicode.org/unicode/reports/tr21/">http://www.unicode.org/unicode/reports/tr21/</a></p>
 285 <p>
 286 </p>
 287 <h2><a name="casespec">casespec</a></h2>
 288 <pre>
 289     use Unicode::UCD 'casespec';</pre>
 290 <pre>
 291     my $casespec = casespec(&quot;FB00&quot;);</pre>
 292 <p>The <code>casespec()</code> returns the potentially locale-dependent case mapping
 293 of the character specified by a <strong>code point argument</strong>.  The mapping
 294 may change the length of the string (which the basic Unicode case
 295 mappings as returned by <code>charinfo()</code> never do).</p>
 296 <p>If there is a case folding for that character, a reference to a hash
 297 with the following fields is returned:</p>
 298 <pre>
 299     key</pre>
 300 <pre>
 301     code             code point with at least four hexdigits
 302     lower            lowercase
 303     title            titlecase
 304     upper            uppercase
 305     condition        condition list (may be undef)</pre>
 306 <p>The <code>condition</code> is optional.  Where present, it consists of one or
 307 more <em>locales</em> or <em>contexts</em>, separated by spaces (other than as
 308 used to separate elements, spaces are to be ignored).  A condition
 309 list overrides the normal behavior if all of the listed conditions are
 310 true.  Case distinctions in the condition list are not significant.
 311 Conditions preceded by ``NON_'' represent the negation of the condition.</p>
 312 <p>Note that when there are multiple case folding definitions for a
 313 single code point because of different locales, the value returned by
 314 <code>casespec()</code> is a hash reference which has the locales as the keys and
 315 hash references as described above as the values.</p>
 316 <p>A <em>locale</em> is defined as a 2-letter ISO 3166 country code, possibly
 317 followed by a ``_'' and a 2-letter ISO language code (possibly followed
 318 by a ``_'' and a variant code).  You can find the lists of those codes,
 319 see <a href="file://C|\msysgit\mingw\html/lib/Locale/Country.html">the Locale::Country manpage</a> and <a href="file://C|\msysgit\mingw\html/lib/Locale/Language.html">the Locale::Language manpage</a>.</p>
 320 <p>A <em>context</em> is one of the following choices:</p>
 321 <pre>
 322     FINAL            The letter is not followed by a letter of
 323                      general category L (e.g. Ll, Lt, Lu, Lm, or Lo)
 324     MODERN           The mapping is only used for modern text
 325     AFTER_i          The last base character was &quot;i&quot; (U+0069)</pre>
 326 <p>For more information about case mappings see
 327 <a href="http://www.unicode.org/unicode/reports/tr21/">http://www.unicode.org/unicode/reports/tr21/</a></p>
 328 <p>
 329 </p>
 330 <h2><a name="namedseq__"><code>namedseq()</code></a></h2>
 331 <pre>
 332     use Unicode::UCD 'namedseq';</pre>
 333 <pre>
 334     my $namedseq = namedseq(&quot;KATAKANA LETTER AINU P&quot;);
 335     my @namedseq = namedseq(&quot;KATAKANA LETTER AINU P&quot;);
 336     my %namedseq = namedseq();</pre>
 337 <p>If used with a single argument in a scalar context, returns the string
 338 consisting of the code points of the named sequence, or <a href="file://C|\msysgit\mingw\html/pod/perlfunc.html#item_undef"><code>undef</code></a> if no
 339 named sequence by that name exists.  If used with a single argument in
 340 a list context, returns list of the code points.  If used with no
 341 arguments in a list context, returns a hash with the names of the
 342 named sequences as the keys and the named sequences as strings as
 343 the values.  Otherwise, returns <a href="file://C|\msysgit\mingw\html/pod/perlfunc.html#item_undef"><code>undef</code></a> or empty list depending
 344 on the context.</p>
 345 <p>(New from Unicode 4.1.0)</p>
 346 <p>
 347 </p>
 348 <h2><a name="unicode__ucd__unicodeversion">Unicode::UCD::UnicodeVersion</a></h2>
 349 <p>Unicode::UCD::UnicodeVersion() returns the version of the Unicode
 350 Character Database, in other words, the version of the Unicode
 351 standard the database implements.  The version is a string
 352 of numbers delimited by dots (<code>'.'</code>).</p>
 353 <p>
 354 </p>
 355 <h2><a name="implementation_note">Implementation Note</a></h2>
 356 <p>The first use of <code>charinfo()</code> opens a read-only filehandle to the Unicode
 357 Character Database (the database is included in the Perl distribution).
 358 The filehandle is then kept open for further queries.  In other words,
 359 if you are wondering where one of your filehandles went, that's where.</p>
 360 <p>
 361 </p>
 362 <hr />
 363 <h1><a name="bugs">BUGS</a></h1>
 364 <p>Does not yet support EBCDIC platforms.</p>
 365 <p>
 366 </p>
 367 <hr />
 368 <h1><a name="author">AUTHOR</a></h1>
 369 <p>Jarkko Hietaniemi</p>
 370 <table border="0" width="100%" cellspacing="0" cellpadding="3">
 371 <tr><td class="block" style="background-color: #cccccc" valign="middle">
 372 <big><strong><span class="block">&nbsp;Unicode::UCD - Unicode character database</span></strong></big>
 373 </td></tr>
 374 </table>
 375
 376 </body>
 377
 378 </html>