Install Perl 5.8.8
[msysgit.git] / mingw / html / lib / utf8.html
blob1e9af647d3fc19f08e987a52e8a735dffdf68248
1 <?xml version="1.0" ?>
2 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
3 <html xmlns="http://www.w3.org/1999/xhtml">
4 <head>
5 <title>utf8 - Perl pragma to enable/disable UTF-8 in source code</title>
6 <meta http-equiv="content-type" content="text/html; charset=utf-8" />
7 <link rev="made" href="mailto:" />
8 </head>
10 <body style="background-color: white">
11 <table border="0" width="100%" cellspacing="0" cellpadding="3">
12 <tr><td class="block" style="background-color: #cccccc" valign="middle">
13 <big><strong><span class="block">&nbsp;utf8 - Perl pragma to enable/disable UTF-8 in source code</span></strong></big>
14 </td></tr>
15 </table>
17 <p><a name="__index__"></a></p>
18 <!-- INDEX BEGIN -->
20 <ul>
22 <li><a href="#name">NAME</a></li>
23 <li><a href="#synopsis">SYNOPSIS</a></li>
24 <li><a href="#description">DESCRIPTION</a></li>
25 <ul>
27 <li><a href="#utility_functions">Utility functions</a></li>
28 </ul>
30 <li><a href="#bugs">BUGS</a></li>
31 <li><a href="#see_also">SEE ALSO</a></li>
32 </ul>
33 <!-- INDEX END -->
35 <hr />
36 <p>
37 </p>
38 <h1><a name="name">NAME</a></h1>
39 <p>utf8 - Perl pragma to enable/disable UTF-8 (or UTF-EBCDIC) in source code</p>
40 <p>
41 </p>
42 <hr />
43 <h1><a name="synopsis">SYNOPSIS</a></h1>
44 <pre>
45 use utf8;
46 no utf8;</pre>
47 <pre>
48 # Convert a Perl scalar to/from UTF-8.
49 $num_octets = utf8::upgrade($string);
50 $success = utf8::downgrade($string[, FAIL_OK]);</pre>
51 <pre>
52 # Change the native bytes of a Perl scalar to/from UTF-8 bytes.
53 utf8::encode($string);
54 utf8::decode($string);</pre>
55 <pre>
56 $flag = utf8::is_utf8(STRING); # since Perl 5.8.1
57 $flag = utf8::valid(STRING);</pre>
58 <p>
59 </p>
60 <hr />
61 <h1><a name="description">DESCRIPTION</a></h1>
62 <p>The <code>use utf8</code> pragma tells the Perl parser to allow UTF-8 in the
63 program text in the current lexical scope (allow UTF-EBCDIC on EBCDIC based
64 platforms). The <code>no utf8</code> pragma tells Perl to switch back to treating
65 the source text as literal bytes in the current lexical scope.</p>
66 <p>This pragma is primarily a compatibility device. Perl versions
67 earlier than 5.6 allowed arbitrary bytes in source code, whereas
68 in future we would like to standardize on the UTF-8 encoding for
69 source text.</p>
70 <p><strong>Do not use this pragma for anything else than telling Perl that your
71 script is written in UTF-8.</strong> The utility functions described below are
72 useful for their own purposes, but they are not really part of the
73 ``pragmatic'' effect.</p>
74 <p>Until UTF-8 becomes the default format for source text, either this
75 pragma or the <a href="file://C|\msysgit\mingw\html/lib/encoding.html">the encoding manpage</a> pragma should be used to recognize UTF-8
76 in the source. When UTF-8 becomes the standard source format, this
77 pragma will effectively become a no-op. For convenience in what
78 follows the term <em>UTF-X</em> is used to refer to UTF-8 on ASCII and ISO
79 Latin based platforms and UTF-EBCDIC on EBCDIC based platforms.</p>
80 <p>See also the effects of the <code>-C</code> switch and its cousin, the
81 <code>$ENV{PERL_UNICODE}</code>, in <a href="file://C|\msysgit\mingw\html/pod/perlrun.html">the perlrun manpage</a>.</p>
82 <p>Enabling the <code>utf8</code> pragma has the following effect:</p>
83 <ul>
84 <li>
85 <p>Bytes in the source text that have their high-bit set will be treated
86 as being part of a literal UTF-8 character. This includes most
87 literals such as identifier names, string constants, and constant
88 regular expression patterns.</p>
89 <p>On EBCDIC platforms characters in the Latin 1 character set are
90 treated as being part of a literal UTF-EBCDIC character.</p>
91 </li>
92 </ul>
93 <p>Note that if you have bytes with the eighth bit on in your script
94 (for example embedded Latin-1 in your string literals), <code>use utf8</code>
95 will be unhappy since the bytes are most probably not well-formed
96 UTF-8. If you want to have such bytes and use utf8, you can disable
97 utf8 until the end the block (or file, if at top level) by <code>no utf8;</code>.</p>
98 <p>If you want to automatically upgrade your 8-bit legacy bytes to UTF-8,
99 use the <a href="file://C|\msysgit\mingw\html/lib/encoding.html">the encoding manpage</a> pragma instead of this pragma. For example, if
100 you want to implicitly upgrade your ISO 8859-1 (Latin-1) bytes to UTF-8
101 as used in e.g. <a href="file://C|\msysgit\mingw\html/pod/perlfunc.html#item_chr"><code>chr()</code></a> and <code>\x{...}</code>, try this:</p>
102 <pre>
103 use encoding &quot;latin-1&quot;;
104 my $c = chr(0xc4);
105 my $x = &quot;\x{c5}&quot;;</pre>
106 <p>In case you are wondering: yes, <code>use encoding 'utf8';</code> works much
107 the same as <code>use utf8;</code>.</p>
109 </p>
110 <h2><a name="utility_functions">Utility functions</a></h2>
111 <p>The following functions are defined in the <code>utf8::</code> package by the
112 Perl core. You do not need to say <code>use utf8</code> to use these and in fact
113 you should not say that unless you really want to have UTF-8 source code.</p>
114 <ul>
115 <li><strong><a name="item_upgrade">$num_octets = utf8::upgrade($string)</a></strong>
117 <p>Converts in-place the octet sequence in the native encoding
118 (Latin-1 or EBCDIC) to the equivalent character sequence in <em>UTF-X</em>.
119 <em>$string</em> already encoded as characters does no harm.
120 Returns the number of octets necessary to represent the string as <em>UTF-X</em>.
121 Can be used to make sure that the UTF-8 flag is on,
122 so that <code>\w</code> or <a href="file://C|\msysgit\mingw\html/pod/perlfunc.html#item_lc"><code>lc()</code></a> work as Unicode on strings
123 containing characters in the range 0x80-0xFF (on ASCII and
124 derivatives).</p>
125 <p><strong>Note that this function does not handle arbitrary encodings.</strong>
126 Therefore <em>Encode.pm</em> is recommended for the general purposes.</p>
127 <p>Affected by the encoding pragma.</p>
128 </li>
129 <li><strong><a name="item_downgrade">$success = utf8::downgrade($string[, FAIL_OK])</a></strong>
131 <p>Converts in-place the character sequence in <em>UTF-X</em>
132 to the equivalent octet sequence in the native encoding (Latin-1 or EBCDIC).
133 <em>$string</em> already encoded as octets does no harm.
134 Returns true on success. On failure dies or, if the value of
135 <code>FAIL_OK</code> is true, returns false.
136 Can be used to make sure that the UTF-8 flag is off,
137 e.g. when you want to make sure that the <a href="file://C|\msysgit\mingw\html/pod/perlvar.html#item_substr"><code>substr()</code></a> or <a href="file://C|\msysgit\mingw\html/pod/perlfunc.html#item_length"><code>length()</code></a> function
138 works with the usually faster byte algorithm.</p>
139 <p><strong>Note that this function does not handle arbitrary encodings.</strong>
140 Therefore <em>Encode.pm</em> is recommended for the general purposes.</p>
141 <p><strong>Not</strong> affected by the encoding pragma.</p>
142 <p><strong>NOTE:</strong> this function is experimental and may change
143 or be removed without notice.</p>
144 </li>
145 <li><strong><a name="item_encode">utf8::encode($string)</a></strong>
147 <p>Converts in-place the character sequence to the corresponding octet sequence
148 in <em>UTF-X</em>. The UTF-8 flag is turned off. Returns nothing.</p>
149 <p><strong>Note that this function does not handle arbitrary encodings.</strong>
150 Therefore <em>Encode.pm</em> is recommended for the general purposes.</p>
151 </li>
152 <li><strong><a name="item_decode">utf8::decode($string)</a></strong>
154 <p>Attempts to convert in-place the octet sequence in <em>UTF-X</em>
155 to the corresponding character sequence. The UTF-8 flag is turned on
156 only if the source string contains multiple-byte <em>UTF-X</em> characters.
157 If <em>$string</em> is invalid as <em>UTF-X</em>, returns false; otherwise returns true.</p>
158 <p><strong>Note that this function does not handle arbitrary encodings.</strong>
159 Therefore <em>Encode.pm</em> is recommended for the general purposes.</p>
160 <p><strong>NOTE:</strong> this function is experimental and may change
161 or be removed without notice.</p>
162 </li>
163 <li><strong><a name="item_is_utf8">$flag = utf8::is_utf8(STRING)</a></strong>
165 <p>(Since Perl 5.8.1) Test whether STRING is in UTF-8. Functionally
166 the same as Encode::is_utf8().</p>
167 </li>
168 <li><strong><a name="item_valid">$flag = utf8::valid(STRING)</a></strong>
170 <p>[INTERNAL] Test whether STRING is in a consistent state regarding
171 UTF-8. Will return true is well-formed UTF-8 and has the UTF-8 flag
172 on <strong>or</strong> if string is held as bytes (both these states are 'consistent').
173 Main reason for this routine is to allow Perl's testsuite to check
174 that operations have left strings in a consistent state. You most
175 probably want to use utf8::is_utf8() instead.</p>
176 </li>
177 </ul>
178 <p><code>utf8::encode</code> is like <code>utf8::upgrade</code>, but the UTF8 flag is
179 cleared. See <a href="file://C|\msysgit\mingw\html/pod/perlunicode.html">the perlunicode manpage</a> for more on the UTF8 flag and the C API
180 functions <code>sv_utf8_upgrade</code>, <code>sv_utf8_downgrade</code>, <code>sv_utf8_encode</code>,
181 and <code>sv_utf8_decode</code>, which are wrapped by the Perl functions
182 <code>utf8::upgrade</code>, <code>utf8::downgrade</code>, <code>utf8::encode</code> and
183 <code>utf8::decode</code>. Note that in the Perl 5.8.0 and 5.8.1 implementation
184 the functions utf8::is_utf8, utf8::valid, utf8::encode, utf8::decode,
185 utf8::upgrade, and utf8::downgrade are always available, without a
186 <code>require utf8</code> statement-- this may change in future releases.</p>
188 </p>
189 <hr />
190 <h1><a name="bugs">BUGS</a></h1>
191 <p>One can have Unicode in identifier names, but not in package/class or
192 subroutine names. While some limited functionality towards this does
193 exist as of Perl 5.8.0, that is more accidental than designed; use of
194 Unicode for the said purposes is unsupported.</p>
195 <p>One reason of this unfinishedness is its (currently) inherent
196 unportability: since both package names and subroutine names may need
197 to be mapped to file and directory names, the Unicode capability of
198 the filesystem becomes important-- and there unfortunately aren't
199 portable answers.</p>
201 </p>
202 <hr />
203 <h1><a name="see_also">SEE ALSO</a></h1>
204 <p><a href="file://C|\msysgit\mingw\html/pod/perluniintro.html">the perluniintro manpage</a>, <a href="file://C|\msysgit\mingw\html/lib/encoding.html">the encoding manpage</a>, <a href="file://C|\msysgit\mingw\html/pod/perlrun.html">the perlrun manpage</a>, <a href="file://C|\msysgit\mingw\html/lib/bytes.html">the bytes manpage</a>, <a href="file://C|\msysgit\mingw\html/pod/perlunicode.html">the perlunicode manpage</a></p>
205 <table border="0" width="100%" cellspacing="0" cellpadding="3">
206 <tr><td class="block" style="background-color: #cccccc" valign="middle">
207 <big><strong><span class="block">&nbsp;utf8 - Perl pragma to enable/disable UTF-8 in source code</span></strong></big>
208 </td></tr>
209 </table>
211 </body>
213 </html>