Install Perl 5.8.8
[msysgit.git] / mingw / html / lib / Unicode / UCD.html
blobad45f2a37d368518c27b7339fd6ac571193b30c3
1 <?xml version="1.0" ?>
2 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
3 <html xmlns="http://www.w3.org/1999/xhtml">
4 <head>
5 <title>Unicode::UCD - Unicode character database</title>
6 <meta http-equiv="content-type" content="text/html; charset=utf-8" />
7 <link rev="made" href="mailto:" />
8 </head>
10 <body style="background-color: white">
11 <table border="0" width="100%" cellspacing="0" cellpadding="3">
12 <tr><td class="block" style="background-color: #cccccc" valign="middle">
13 <big><strong><span class="block">&nbsp;Unicode::UCD - Unicode character database</span></strong></big>
14 </td></tr>
15 </table>
17 <p><a name="__index__"></a></p>
18 <!-- INDEX BEGIN -->
20 <ul>
22 <li><a href="#name">NAME</a></li>
23 <li><a href="#synopsis">SYNOPSIS</a></li>
24 <li><a href="#description">DESCRIPTION</a></li>
25 <ul>
27 <li><a href="#charinfo">charinfo</a></li>
28 <li><a href="#charblock">charblock</a></li>
29 <li><a href="#charscript">charscript</a></li>
30 <li><a href="#charblocks">charblocks</a></li>
31 <li><a href="#charscripts">charscripts</a></li>
32 <li><a href="#blocks_versus_scripts">Blocks versus Scripts</a></li>
33 <li><a href="#matching_scripts_and_blocks">Matching Scripts and Blocks</a></li>
34 <li><a href="#code_point_arguments">Code Point Arguments</a></li>
35 <li><a href="#charinrange">charinrange</a></li>
36 <li><a href="#compexcl">compexcl</a></li>
37 <li><a href="#casefold">casefold</a></li>
38 <li><a href="#casespec">casespec</a></li>
39 <li><a href="#namedseq__"><code>namedseq()</code></a></li>
40 <li><a href="#unicode__ucd__unicodeversion">Unicode::UCD::UnicodeVersion</a></li>
41 <li><a href="#implementation_note">Implementation Note</a></li>
42 </ul>
44 <li><a href="#bugs">BUGS</a></li>
45 <li><a href="#author">AUTHOR</a></li>
46 </ul>
47 <!-- INDEX END -->
49 <hr />
50 <p>
51 </p>
52 <h1><a name="name">NAME</a></h1>
53 <p>Unicode::UCD - Unicode character database</p>
54 <p>
55 </p>
56 <hr />
57 <h1><a name="synopsis">SYNOPSIS</a></h1>
58 <pre>
59 use Unicode::UCD 'charinfo';
60 my $charinfo = charinfo($codepoint);</pre>
61 <pre>
62 use Unicode::UCD 'charblock';
63 my $charblock = charblock($codepoint);</pre>
64 <pre>
65 use Unicode::UCD 'charscript';
66 my $charscript = charscript($codepoint);</pre>
67 <pre>
68 use Unicode::UCD 'charblocks';
69 my $charblocks = charblocks();</pre>
70 <pre>
71 use Unicode::UCD 'charscripts';
72 my %charscripts = charscripts();</pre>
73 <pre>
74 use Unicode::UCD qw(charscript charinrange);
75 my $range = charscript($script);
76 print &quot;looks like $script\n&quot; if charinrange($range, $codepoint);</pre>
77 <pre>
78 use Unicode::UCD 'compexcl';
79 my $compexcl = compexcl($codepoint);</pre>
80 <pre>
81 use Unicode::UCD 'namedseq';
82 my $namedseq = namedseq($named_sequence_name);</pre>
83 <pre>
84 my $unicode_version = Unicode::UCD::UnicodeVersion();</pre>
85 <p>
86 </p>
87 <hr />
88 <h1><a name="description">DESCRIPTION</a></h1>
89 <p>The Unicode::UCD module offers a simple interface to the Unicode
90 Character Database.</p>
91 <p>
92 </p>
93 <h2><a name="charinfo">charinfo</a></h2>
94 <pre>
95 use Unicode::UCD 'charinfo';</pre>
96 <pre>
97 my $charinfo = charinfo(0x41);</pre>
98 <p><code>charinfo()</code> returns a reference to a hash that has the following fields
99 as defined by the Unicode standard:</p>
100 <pre>
101 key</pre>
102 <pre>
103 code code point with at least four hexdigits
104 name name of the character IN UPPER CASE
105 category general category of the character
106 combining classes used in the Canonical Ordering Algorithm
107 bidi bidirectional category
108 decomposition character decomposition mapping
109 decimal if decimal digit this is the integer numeric value
110 digit if digit this is the numeric value
111 numeric if numeric is the integer or rational numeric value
112 mirrored if mirrored in bidirectional text
113 unicode10 Unicode 1.0 name if existed and different
114 comment ISO 10646 comment field
115 upper uppercase equivalent mapping
116 lower lowercase equivalent mapping
117 title titlecase equivalent mapping</pre>
118 <pre>
119 block block the character belongs to (used in \p{In...})
120 script script the character belongs to</pre>
121 <p>If no match is found, a reference to an empty hash is returned.</p>
122 <p>The <code>block</code> property is the same as returned by charinfo(). It is
123 not defined in the Unicode Character Database proper (Chapter 4 of the
124 Unicode 3.0 Standard, aka TUS3) but instead in an auxiliary database
125 (Chapter 14 of TUS3). Similarly for the <code>script</code> property.</p>
126 <p>Note that you cannot do (de)composition and casing based solely on the
127 above <code>decomposition</code> and <code>lower</code>, <code>upper</code>, <code>title</code>, properties,
128 you will need also the compexcl(), casefold(), and <code>casespec()</code> functions.</p>
130 </p>
131 <h2><a name="charblock">charblock</a></h2>
132 <pre>
133 use Unicode::UCD 'charblock';</pre>
134 <pre>
135 my $charblock = charblock(0x41);
136 my $charblock = charblock(1234);
137 my $charblock = charblock(&quot;0x263a&quot;);
138 my $charblock = charblock(&quot;U+263a&quot;);</pre>
139 <pre>
140 my $range = charblock('Armenian');</pre>
141 <p>With a <strong>code point argument</strong> <code>charblock()</code> returns the <em>block</em> the character
142 belongs to, e.g. <code>Basic Latin</code>. Note that not all the character
143 positions within all blocks are defined.</p>
144 <p>See also <a href="#blocks_versus_scripts">Blocks versus Scripts</a>.</p>
145 <p>If supplied with an argument that can't be a code point, <code>charblock()</code> tries
146 to do the opposite and interpret the argument as a character block. The
147 return value is a <em>range</em>: an anonymous list of lists that contain
148 <em>start-of-range</em>, <em>end-of-range</em> code point pairs. You can test whether
149 a code point is in a range using the <a href="#charinrange">charinrange</a> function. If the
150 argument is not a known character block, <a href="file://C|\msysgit\mingw\html/pod/perlfunc.html#item_undef"><code>undef</code></a> is returned.</p>
152 </p>
153 <h2><a name="charscript">charscript</a></h2>
154 <pre>
155 use Unicode::UCD 'charscript';</pre>
156 <pre>
157 my $charscript = charscript(0x41);
158 my $charscript = charscript(1234);
159 my $charscript = charscript(&quot;U+263a&quot;);</pre>
160 <pre>
161 my $range = charscript('Thai');</pre>
162 <p>With a <strong>code point argument</strong> <code>charscript()</code> returns the <em>script</em> the
163 character belongs to, e.g. <code>Latin</code>, <code>Greek</code>, <code>Han</code>.</p>
164 <p>See also <a href="#blocks_versus_scripts">Blocks versus Scripts</a>.</p>
165 <p>If supplied with an argument that can't be a code point, <code>charscript()</code> tries
166 to do the opposite and interpret the argument as a character script. The
167 return value is a <em>range</em>: an anonymous list of lists that contain
168 <em>start-of-range</em>, <em>end-of-range</em> code point pairs. You can test whether a
169 code point is in a range using the <a href="#charinrange">charinrange</a> function. If the
170 argument is not a known character script, <a href="file://C|\msysgit\mingw\html/pod/perlfunc.html#item_undef"><code>undef</code></a> is returned.</p>
172 </p>
173 <h2><a name="charblocks">charblocks</a></h2>
174 <pre>
175 use Unicode::UCD 'charblocks';</pre>
176 <pre>
177 my $charblocks = charblocks();</pre>
178 <p><code>charblocks()</code> returns a reference to a hash with the known block names
179 as the keys, and the code point ranges (see <a href="#charblock">charblock</a>) as the values.</p>
180 <p>See also <a href="#blocks_versus_scripts">Blocks versus Scripts</a>.</p>
182 </p>
183 <h2><a name="charscripts">charscripts</a></h2>
184 <pre>
185 use Unicode::UCD 'charscripts';</pre>
186 <pre>
187 my %charscripts = charscripts();</pre>
188 <p><code>charscripts()</code> returns a hash with the known script names as the keys,
189 and the code point ranges (see <a href="#charscript">charscript</a>) as the values.</p>
190 <p>See also <a href="#blocks_versus_scripts">Blocks versus Scripts</a>.</p>
192 </p>
193 <h2><a name="blocks_versus_scripts">Blocks versus Scripts</a></h2>
194 <p>The difference between a block and a script is that scripts are closer
195 to the linguistic notion of a set of characters required to present
196 languages, while block is more of an artifact of the Unicode character
197 numbering and separation into blocks of (mostly) 256 characters.</p>
198 <p>For example the Latin <strong>script</strong> is spread over several <strong>blocks</strong>, such
199 as <code>Basic Latin</code>, <code>Latin 1 Supplement</code>, <code>Latin Extended-A</code>, and
200 <code>Latin Extended-B</code>. On the other hand, the Latin script does not
201 contain all the characters of the <code>Basic Latin</code> block (also known as
202 the ASCII): it includes only the letters, and not, for example, the digits
203 or the punctuation.</p>
204 <p>For blocks see <a href="http://www.unicode.org/Public/UNIDATA/Blocks.txt">http://www.unicode.org/Public/UNIDATA/Blocks.txt</a></p>
205 <p>For scripts see UTR #24: <a href="http://www.unicode.org/unicode/reports/tr24/">http://www.unicode.org/unicode/reports/tr24/</a></p>
207 </p>
208 <h2><a name="matching_scripts_and_blocks">Matching Scripts and Blocks</a></h2>
209 <p>Scripts are matched with the regular-expression construct
210 <code>\p{...}</code> (e.g. <code>\p{Tibetan}</code> matches characters of the Tibetan script),
211 while <code>\p{In...}</code> is used for blocks (e.g. <code>\p{InTibetan}</code> matches
212 any of the 256 code points in the Tibetan block).</p>
214 </p>
215 <h2><a name="code_point_arguments">Code Point Arguments</a></h2>
216 <p>A <em>code point argument</em> is either a decimal or a hexadecimal scalar
217 designating a Unicode character, or <code>U+</code> followed by hexadecimals
218 designating a Unicode character. In other words, if you want a code
219 point to be interpreted as a hexadecimal number, you must prefix it
220 with either <code>0x</code> or <code>U+</code>, because a string like e.g. <code>123</code> will
221 be interpreted as a decimal code point. Also note that Unicode is
222 <strong>not</strong> limited to 16 bits (the number of Unicode characters is
223 open-ended, in theory unlimited): you may have more than 4 hexdigits.</p>
225 </p>
226 <h2><a name="charinrange">charinrange</a></h2>
227 <p>In addition to using the <code>\p{In...}</code> and <code>\P{In...}</code> constructs, you
228 can also test whether a code point is in the <em>range</em> as returned by
229 <a href="#charblock">charblock</a> and <a href="#charscript">charscript</a> or as the values of the hash returned
230 by <a href="#charblocks">charblocks</a> and <a href="#charscripts">charscripts</a> by using charinrange():</p>
231 <pre>
232 use Unicode::UCD qw(charscript charinrange);</pre>
233 <pre>
234 $range = charscript('Hiragana');
235 print &quot;looks like hiragana\n&quot; if charinrange($range, $codepoint);</pre>
237 </p>
238 <h2><a name="compexcl">compexcl</a></h2>
239 <pre>
240 use Unicode::UCD 'compexcl';</pre>
241 <pre>
242 my $compexcl = compexcl(&quot;09dc&quot;);</pre>
243 <p>The <code>compexcl()</code> returns the composition exclusion (that is, if the
244 character should not be produced during a precomposition) of the
245 character specified by a <strong>code point argument</strong>.</p>
246 <p>If there is a composition exclusion for the character, true is
247 returned. Otherwise, false is returned.</p>
249 </p>
250 <h2><a name="casefold">casefold</a></h2>
251 <pre>
252 use Unicode::UCD 'casefold';</pre>
253 <pre>
254 my $casefold = casefold(&quot;00DF&quot;);</pre>
255 <p>The <code>casefold()</code> returns the locale-independent case folding of the
256 character specified by a <strong>code point argument</strong>.</p>
257 <p>If there is a case folding for that character, a reference to a hash
258 with the following fields is returned:</p>
259 <pre>
260 key</pre>
261 <pre>
262 code code point with at least four hexdigits
263 status &quot;C&quot;, &quot;F&quot;, &quot;S&quot;, or &quot;I&quot;
264 mapping one or more codes separated by spaces</pre>
265 <p>The meaning of the <em>status</em> is as follows:</p>
266 <pre>
267 C common case folding, common mappings shared
268 by both simple and full mappings
269 F full case folding, mappings that cause strings
270 to grow in length. Multiple characters are separated
271 by spaces
272 S simple case folding, mappings to single characters
273 where different from F
274 I special case for dotted uppercase I and
275 dotless lowercase i
276 - If this mapping is included, the result is
277 case-insensitive, but dotless and dotted I's
278 are not distinguished
279 - If this mapping is excluded, the result is not
280 fully case-insensitive, but dotless and dotted
281 I's are distinguished</pre>
282 <p>If there is no case folding for that character, <a href="file://C|\msysgit\mingw\html/pod/perlfunc.html#item_undef"><code>undef</code></a> is returned.</p>
283 <p>For more information about case mappings see
284 <a href="http://www.unicode.org/unicode/reports/tr21/">http://www.unicode.org/unicode/reports/tr21/</a></p>
286 </p>
287 <h2><a name="casespec">casespec</a></h2>
288 <pre>
289 use Unicode::UCD 'casespec';</pre>
290 <pre>
291 my $casespec = casespec(&quot;FB00&quot;);</pre>
292 <p>The <code>casespec()</code> returns the potentially locale-dependent case mapping
293 of the character specified by a <strong>code point argument</strong>. The mapping
294 may change the length of the string (which the basic Unicode case
295 mappings as returned by <code>charinfo()</code> never do).</p>
296 <p>If there is a case folding for that character, a reference to a hash
297 with the following fields is returned:</p>
298 <pre>
299 key</pre>
300 <pre>
301 code code point with at least four hexdigits
302 lower lowercase
303 title titlecase
304 upper uppercase
305 condition condition list (may be undef)</pre>
306 <p>The <code>condition</code> is optional. Where present, it consists of one or
307 more <em>locales</em> or <em>contexts</em>, separated by spaces (other than as
308 used to separate elements, spaces are to be ignored). A condition
309 list overrides the normal behavior if all of the listed conditions are
310 true. Case distinctions in the condition list are not significant.
311 Conditions preceded by ``NON_'' represent the negation of the condition.</p>
312 <p>Note that when there are multiple case folding definitions for a
313 single code point because of different locales, the value returned by
314 <code>casespec()</code> is a hash reference which has the locales as the keys and
315 hash references as described above as the values.</p>
316 <p>A <em>locale</em> is defined as a 2-letter ISO 3166 country code, possibly
317 followed by a ``_'' and a 2-letter ISO language code (possibly followed
318 by a ``_'' and a variant code). You can find the lists of those codes,
319 see <a href="file://C|\msysgit\mingw\html/lib/Locale/Country.html">the Locale::Country manpage</a> and <a href="file://C|\msysgit\mingw\html/lib/Locale/Language.html">the Locale::Language manpage</a>.</p>
320 <p>A <em>context</em> is one of the following choices:</p>
321 <pre>
322 FINAL The letter is not followed by a letter of
323 general category L (e.g. Ll, Lt, Lu, Lm, or Lo)
324 MODERN The mapping is only used for modern text
325 AFTER_i The last base character was &quot;i&quot; (U+0069)</pre>
326 <p>For more information about case mappings see
327 <a href="http://www.unicode.org/unicode/reports/tr21/">http://www.unicode.org/unicode/reports/tr21/</a></p>
329 </p>
330 <h2><a name="namedseq__"><code>namedseq()</code></a></h2>
331 <pre>
332 use Unicode::UCD 'namedseq';</pre>
333 <pre>
334 my $namedseq = namedseq(&quot;KATAKANA LETTER AINU P&quot;);
335 my @namedseq = namedseq(&quot;KATAKANA LETTER AINU P&quot;);
336 my %namedseq = namedseq();</pre>
337 <p>If used with a single argument in a scalar context, returns the string
338 consisting of the code points of the named sequence, or <a href="file://C|\msysgit\mingw\html/pod/perlfunc.html#item_undef"><code>undef</code></a> if no
339 named sequence by that name exists. If used with a single argument in
340 a list context, returns list of the code points. If used with no
341 arguments in a list context, returns a hash with the names of the
342 named sequences as the keys and the named sequences as strings as
343 the values. Otherwise, returns <a href="file://C|\msysgit\mingw\html/pod/perlfunc.html#item_undef"><code>undef</code></a> or empty list depending
344 on the context.</p>
345 <p>(New from Unicode 4.1.0)</p>
347 </p>
348 <h2><a name="unicode__ucd__unicodeversion">Unicode::UCD::UnicodeVersion</a></h2>
349 <p>Unicode::UCD::UnicodeVersion() returns the version of the Unicode
350 Character Database, in other words, the version of the Unicode
351 standard the database implements. The version is a string
352 of numbers delimited by dots (<code>'.'</code>).</p>
354 </p>
355 <h2><a name="implementation_note">Implementation Note</a></h2>
356 <p>The first use of <code>charinfo()</code> opens a read-only filehandle to the Unicode
357 Character Database (the database is included in the Perl distribution).
358 The filehandle is then kept open for further queries. In other words,
359 if you are wondering where one of your filehandles went, that's where.</p>
361 </p>
362 <hr />
363 <h1><a name="bugs">BUGS</a></h1>
364 <p>Does not yet support EBCDIC platforms.</p>
366 </p>
367 <hr />
368 <h1><a name="author">AUTHOR</a></h1>
369 <p>Jarkko Hietaniemi</p>
370 <table border="0" width="100%" cellspacing="0" cellpadding="3">
371 <tr><td class="block" style="background-color: #cccccc" valign="middle">
372 <big><strong><span class="block">&nbsp;Unicode::UCD - Unicode character database</span></strong></big>
373 </td></tr>
374 </table>
376 </body>
378 </html>