2 <!DOCTYPE html PUBLIC
"-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
3 <html xmlns=
"http://www.w3.org/1999/xhtml">
5 <title>utf8 - Perl pragma to enable/disable UTF-
8 in source code
</title>
6 <meta http-equiv=
"content-type" content=
"text/html; charset=utf-8" />
7 <link rev=
"made" href=
"mailto:" />
10 <body style=
"background-color: white">
11 <table border=
"0" width=
"100%" cellspacing=
"0" cellpadding=
"3">
12 <tr><td class=
"block" style=
"background-color: #cccccc" valign=
"middle">
13 <big><strong><span class=
"block"> utf8 - Perl pragma to enable/disable UTF-
8 in source code
</span></strong></big>
17 <p><a name=
"__index__"></a></p>
22 <li><a href=
"#name">NAME
</a></li>
23 <li><a href=
"#synopsis">SYNOPSIS
</a></li>
24 <li><a href=
"#description">DESCRIPTION
</a></li>
27 <li><a href=
"#utility_functions">Utility functions
</a></li>
30 <li><a href=
"#bugs">BUGS
</a></li>
31 <li><a href=
"#see_also">SEE ALSO
</a></li>
38 <h1><a name=
"name">NAME
</a></h1>
39 <p>utf8 - Perl pragma to enable/disable UTF-
8 (or UTF-EBCDIC) in source code
</p>
43 <h1><a name=
"synopsis">SYNOPSIS
</a></h1>
48 # Convert a Perl scalar to/from UTF-
8.
49 $num_octets = utf8::upgrade($string);
50 $success = utf8::downgrade($string[, FAIL_OK]);
</pre>
52 # Change the native bytes of a Perl scalar to/from UTF-
8 bytes.
53 utf8::encode($string);
54 utf8::decode($string);
</pre>
56 $flag = utf8::is_utf8(STRING); # since Perl
5.8.1
57 $flag = utf8::valid(STRING);
</pre>
61 <h1><a name=
"description">DESCRIPTION
</a></h1>
62 <p>The
<code>use utf8
</code> pragma tells the Perl parser to allow UTF-
8 in the
63 program text in the current lexical scope (allow UTF-EBCDIC on EBCDIC based
64 platforms). The
<code>no utf8
</code> pragma tells Perl to switch back to treating
65 the source text as literal bytes in the current lexical scope.
</p>
66 <p>This pragma is primarily a compatibility device. Perl versions
67 earlier than
5.6 allowed arbitrary bytes in source code, whereas
68 in future we would like to standardize on the UTF-
8 encoding for
70 <p><strong>Do not use this pragma for anything else than telling Perl that your
71 script is written in UTF-
8.
</strong> The utility functions described below are
72 useful for their own purposes, but they are not really part of the
73 ``pragmatic'' effect.
</p>
74 <p>Until UTF-
8 becomes the default format for source text, either this
75 pragma or the
<a href=
"file://C|\msysgit\mingw\html/lib/encoding.html">the encoding manpage
</a> pragma should be used to recognize UTF-
8
76 in the source. When UTF-
8 becomes the standard source format, this
77 pragma will effectively become a no-op. For convenience in what
78 follows the term
<em>UTF-X
</em> is used to refer to UTF-
8 on ASCII and ISO
79 Latin based platforms and UTF-EBCDIC on EBCDIC based platforms.
</p>
80 <p>See also the effects of the
<code>-C
</code> switch and its cousin, the
81 <code>$ENV{PERL_UNICODE}
</code>, in
<a href=
"file://C|\msysgit\mingw\html/pod/perlrun.html">the perlrun manpage
</a>.
</p>
82 <p>Enabling the
<code>utf8
</code> pragma has the following effect:
</p>
85 <p>Bytes in the source text that have their high-bit set will be treated
86 as being part of a literal UTF-
8 character. This includes most
87 literals such as identifier names, string constants, and constant
88 regular expression patterns.
</p>
89 <p>On EBCDIC platforms characters in the Latin
1 character set are
90 treated as being part of a literal UTF-EBCDIC character.
</p>
93 <p>Note that if you have bytes with the eighth bit on in your script
94 (for example embedded Latin-
1 in your string literals),
<code>use utf8
</code>
95 will be unhappy since the bytes are most probably not well-formed
96 UTF-
8. If you want to have such bytes and use utf8, you can disable
97 utf8 until the end the block (or file, if at top level) by
<code>no utf8;
</code>.
</p>
98 <p>If you want to automatically upgrade your
8-bit legacy bytes to UTF-
8,
99 use the
<a href=
"file://C|\msysgit\mingw\html/lib/encoding.html">the encoding manpage
</a> pragma instead of this pragma. For example, if
100 you want to implicitly upgrade your ISO
8859-
1 (Latin-
1) bytes to UTF-
8
101 as used in e.g.
<a href=
"file://C|\msysgit\mingw\html/pod/perlfunc.html#item_chr"><code>chr()
</code></a> and
<code>\x{...}
</code>, try this:
</p>
103 use encoding
"latin-
1";
105 my $x =
"\x{c5}
";
</pre>
106 <p>In case you are wondering: yes,
<code>use encoding 'utf8';
</code> works much
107 the same as
<code>use utf8;
</code>.
</p>
110 <h2><a name=
"utility_functions">Utility functions
</a></h2>
111 <p>The following functions are defined in the
<code>utf8::
</code> package by the
112 Perl core. You do not need to say
<code>use utf8
</code> to use these and in fact
113 you should not say that unless you really want to have UTF-
8 source code.
</p>
115 <li><strong><a name=
"item_upgrade">$num_octets = utf8::upgrade($string)
</a></strong>
117 <p>Converts in-place the octet sequence in the native encoding
118 (Latin-
1 or EBCDIC) to the equivalent character sequence in
<em>UTF-X
</em>.
119 <em>$string
</em> already encoded as characters does no harm.
120 Returns the number of octets necessary to represent the string as
<em>UTF-X
</em>.
121 Can be used to make sure that the UTF-
8 flag is on,
122 so that
<code>\w
</code> or
<a href=
"file://C|\msysgit\mingw\html/pod/perlfunc.html#item_lc"><code>lc()
</code></a> work as Unicode on strings
123 containing characters in the range
0x80-
0xFF (on ASCII and
125 <p><strong>Note that this function does not handle arbitrary encodings.
</strong>
126 Therefore
<em>Encode.pm
</em> is recommended for the general purposes.
</p>
127 <p>Affected by the encoding pragma.
</p>
129 <li><strong><a name=
"item_downgrade">$success = utf8::downgrade($string[, FAIL_OK])
</a></strong>
131 <p>Converts in-place the character sequence in
<em>UTF-X
</em>
132 to the equivalent octet sequence in the native encoding (Latin-
1 or EBCDIC).
133 <em>$string
</em> already encoded as octets does no harm.
134 Returns true on success. On failure dies or, if the value of
135 <code>FAIL_OK
</code> is true, returns false.
136 Can be used to make sure that the UTF-
8 flag is off,
137 e.g. when you want to make sure that the
<a href=
"file://C|\msysgit\mingw\html/pod/perlvar.html#item_substr"><code>substr()
</code></a> or
<a href=
"file://C|\msysgit\mingw\html/pod/perlfunc.html#item_length"><code>length()
</code></a> function
138 works with the usually faster byte algorithm.
</p>
139 <p><strong>Note that this function does not handle arbitrary encodings.
</strong>
140 Therefore
<em>Encode.pm
</em> is recommended for the general purposes.
</p>
141 <p><strong>Not
</strong> affected by the encoding pragma.
</p>
142 <p><strong>NOTE:
</strong> this function is experimental and may change
143 or be removed without notice.
</p>
145 <li><strong><a name=
"item_encode">utf8::encode($string)
</a></strong>
147 <p>Converts in-place the character sequence to the corresponding octet sequence
148 in
<em>UTF-X
</em>. The UTF-
8 flag is turned off. Returns nothing.
</p>
149 <p><strong>Note that this function does not handle arbitrary encodings.
</strong>
150 Therefore
<em>Encode.pm
</em> is recommended for the general purposes.
</p>
152 <li><strong><a name=
"item_decode">utf8::decode($string)
</a></strong>
154 <p>Attempts to convert in-place the octet sequence in
<em>UTF-X
</em>
155 to the corresponding character sequence. The UTF-
8 flag is turned on
156 only if the source string contains multiple-byte
<em>UTF-X
</em> characters.
157 If
<em>$string
</em> is invalid as
<em>UTF-X
</em>, returns false; otherwise returns true.
</p>
158 <p><strong>Note that this function does not handle arbitrary encodings.
</strong>
159 Therefore
<em>Encode.pm
</em> is recommended for the general purposes.
</p>
160 <p><strong>NOTE:
</strong> this function is experimental and may change
161 or be removed without notice.
</p>
163 <li><strong><a name=
"item_is_utf8">$flag = utf8::is_utf8(STRING)
</a></strong>
165 <p>(Since Perl
5.8.1) Test whether STRING is in UTF-
8. Functionally
166 the same as Encode::is_utf8().
</p>
168 <li><strong><a name=
"item_valid">$flag = utf8::valid(STRING)
</a></strong>
170 <p>[INTERNAL] Test whether STRING is in a consistent state regarding
171 UTF-
8. Will return true is well-formed UTF-
8 and has the UTF-
8 flag
172 on
<strong>or
</strong> if string is held as bytes (both these states are 'consistent').
173 Main reason for this routine is to allow Perl's testsuite to check
174 that operations have left strings in a consistent state. You most
175 probably want to use utf8::is_utf8() instead.
</p>
178 <p><code>utf8::encode
</code> is like
<code>utf8::upgrade
</code>, but the UTF8 flag is
179 cleared. See
<a href=
"file://C|\msysgit\mingw\html/pod/perlunicode.html">the perlunicode manpage
</a> for more on the UTF8 flag and the C API
180 functions
<code>sv_utf8_upgrade
</code>,
<code>sv_utf8_downgrade
</code>,
<code>sv_utf8_encode
</code>,
181 and
<code>sv_utf8_decode
</code>, which are wrapped by the Perl functions
182 <code>utf8::upgrade
</code>,
<code>utf8::downgrade
</code>,
<code>utf8::encode
</code> and
183 <code>utf8::decode
</code>. Note that in the Perl
5.8.0 and
5.8.1 implementation
184 the functions utf8::is_utf8, utf8::valid, utf8::encode, utf8::decode,
185 utf8::upgrade, and utf8::downgrade are always available, without a
186 <code>require utf8
</code> statement-- this may change in future releases.
</p>
190 <h1><a name=
"bugs">BUGS
</a></h1>
191 <p>One can have Unicode in identifier names, but not in package/class or
192 subroutine names. While some limited functionality towards this does
193 exist as of Perl
5.8.0, that is more accidental than designed; use of
194 Unicode for the said purposes is unsupported.
</p>
195 <p>One reason of this unfinishedness is its (currently) inherent
196 unportability: since both package names and subroutine names may need
197 to be mapped to file and directory names, the Unicode capability of
198 the filesystem becomes important-- and there unfortunately aren't
199 portable answers.
</p>
203 <h1><a name=
"see_also">SEE ALSO
</a></h1>
204 <p><a href=
"file://C|\msysgit\mingw\html/pod/perluniintro.html">the perluniintro manpage
</a>,
<a href=
"file://C|\msysgit\mingw\html/lib/encoding.html">the encoding manpage
</a>,
<a href=
"file://C|\msysgit\mingw\html/pod/perlrun.html">the perlrun manpage
</a>,
<a href=
"file://C|\msysgit\mingw\html/lib/bytes.html">the bytes manpage
</a>,
<a href=
"file://C|\msysgit\mingw\html/pod/perlunicode.html">the perlunicode manpage
</a></p>
205 <table border=
"0" width=
"100%" cellspacing=
"0" cellpadding=
"3">
206 <tr><td class=
"block" style=
"background-color: #cccccc" valign=
"middle">
207 <big><strong><span class=
"block"> utf8 - Perl pragma to enable/disable UTF-
8 in source code
</span></strong></big>