2 <!DOCTYPE html PUBLIC
"-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
3 <html xmlns=
"http://www.w3.org/1999/xhtml">
5 <title>encoding - allows you to write your script in non-ascii or non-utf8
</title>
6 <meta http-equiv=
"content-type" content=
"text/html; charset=utf-8" />
7 <link rev=
"made" href=
"mailto:" />
10 <body style=
"background-color: white">
11 <table border=
"0" width=
"100%" cellspacing=
"0" cellpadding=
"3">
12 <tr><td class=
"block" style=
"background-color: #cccccc" valign=
"middle">
13 <big><strong><span class=
"block"> encoding - allows you to write your script in non-ascii or non-utf8
</span></strong></big>
17 <p><a name=
"__index__"></a></p>
22 <li><a href=
"#name">NAME
</a></li>
23 <li><a href=
"#synopsis">SYNOPSIS
</a></li>
24 <li><a href=
"#abstract">ABSTRACT
</a></li>
27 <li><a href=
"#literal_conversions">Literal Conversions
</a></li>
28 <li><a href=
"#perlio_layers_for_std_in_out_">PerlIO layers for
<code>STD(IN|OUT)
</code></a></li>
29 <li><a href=
"#implicit_upgrading_for_byte_strings">Implicit upgrading for byte strings
</a></li>
32 <li><a href=
"#features_that_require_5_8_1">FEATURES THAT REQUIRE
5.8.1</a></li>
33 <li><a href=
"#usage">USAGE
</a></li>
34 <li><a href=
"#the_filter_option">The Filter Option
</a></li>
37 <li><a href=
"#filterrelated_changes_at_encode_version_1_87">Filter-related changes at Encode version
1.87</a></li>
40 <li><a href=
"#caveats">CAVEATS
</a></li>
43 <li><a href=
"#not_scoped">NOT SCOPED
</a></li>
44 <li><a href=
"#do_not_mix_multiple_encodings">DO NOT MIX MULTIPLE ENCODINGS
</a></li>
45 <li><a href=
"#tr____with_ranges">tr/// with ranges
</a></li>
48 <li><a href=
"#workaround_to_tr____">workaround to tr///;
</a></li>
53 <li><a href=
"#example__greekperl">EXAMPLE - Greekperl
</a></li>
54 <li><a href=
"#known_problems">KNOWN PROBLEMS
</a></li>
57 <li><a href=
"#the_logic_of__locale">The Logic of :locale
</a></li>
60 <li><a href=
"#history">HISTORY
</a></li>
61 <li><a href=
"#see_also">SEE ALSO
</a></li>
69 <h1><a name=
"name">NAME
</a></h1>
70 <p>encoding - allows you to write your script in non-ascii or non-utf8
</p>
74 <h1><a name=
"synopsis">SYNOPSIS
</a></h1>
76 use encoding
"greek
"; # Perl like Greek to you?
77 use encoding
"euc-jp
"; # Jperl!
</pre>
79 # or you can even do this if your shell supports your native encoding
</pre>
81 perl -Mencoding=latin2 -e '...' # Feeling centrally European?
82 perl -Mencoding=euc-kr -e '...' # Or Korean?
</pre>
86 # A simple euc-cn =
> utf-
8 converter
87 use encoding
"euc-cn
", STDOUT =
> "utf8
"; while(
<>){print};
</pre>
89 #
"no encoding;
" supported (but not scoped!)
92 # an alternate way, Filter
93 use encoding
"euc-jp
", Filter=
>1;
94 # now you can use kanji identifiers -- in euc-jp!
</pre>
97 # note that this probably means that unless you have a complete control
98 # over the environments the application is ever going to be run, you should
99 # NOT use the feature of encoding pragma allowing you to write your script
100 # in any recognized encoding because changing locale settings will wreck
101 # the script; you can of course still use the other features of the pragma.
102 use encoding ':locale';
</pre>
106 <h1><a name=
"abstract">ABSTRACT
</a></h1>
107 <p>Let's start with a bit of history: Perl
5.6.0 introduced Unicode
108 support. You could apply
<a href=
"file://C|\msysgit\mingw\html/pod/perlvar.html#item_substr"><code>substr()
</code></a> and regexes even to complex CJK
109 characters -- so long as the script was written in UTF-
8. But back
110 then, text editors that supported UTF-
8 were still rare and many users
111 instead chose to write scripts in legacy encodings, giving up a whole
112 new feature of Perl
5.6.
</p>
113 <p>Rewind to the future: starting from perl
5.8.0 with the
<strong>encoding
</strong>
114 pragma, you can write your script in any encoding you like (so long
115 as the
<code>Encode
</code> module supports it) and still enjoy Unicode support.
116 This pragma achieves that by doing the following:
</p>
119 <p>Internally converts all literals (
<a href=
"file://C|\msysgit\mingw\html/pod/perlfunc.html#item_q_"><code>q//,qq//,qr//,qw///, qx//
</code></a>) from
120 the encoding specified to utf8. In Perl
5.8.1 and later, literals in
121 <a href=
"#item_tr_"><code>tr///
</code></a> and
<code>DATA
</code> pseudo-filehandle are also converted.
</p>
124 <p>Changing PerlIO layers of
<code>STDIN
</code> and
<code>STDOUT
</code> to the encoding
130 <h2><a name=
"literal_conversions">Literal Conversions
</a></h2>
131 <p>You can write code in EUC-JP as follows:
</p>
133 my $Rakuda =
"\xF1\xD1\xF1\xCC
"; # Camel in Kanji
134 #
<-char-
><-char-
> #
4 octets
135 s/\bCamel\b/$Rakuda/;
</pre>
136 <p>And with
<code>use encoding
"euc-jp
"</code> in effect, it is the same thing as
137 the code in UTF-
8:
</p>
139 my $Rakuda =
"\x{
99F1}\x{
99DD}
"; # two Unicode Characters
140 s/\bCamel\b/$Rakuda/;
</pre>
143 <h2><a name=
"perlio_layers_for_std_in_out_">PerlIO layers for
<code>STD(IN|OUT)
</code></a></h2>
144 <p>The
<strong>encoding
</strong> pragma also modifies the filehandle layers of
145 STDIN and STDOUT to the specified encoding. Therefore,
</p>
147 use encoding
"euc-jp
";
148 my $message =
"Camel is the symbol of perl.\n
";
149 my $Rakuda =
"\xF1\xD1\xF1\xCC
"; # Camel in Kanji
150 $message =~ s/\bCamel\b/$Rakuda/;
151 print $message;
</pre>
152 <p>Will print ``\xF1\xD1\xF1\xCC is the symbol of perl.\n'',
153 not ``\x{
99F1}\x{
99DD} is the symbol of perl.\n''.
</p>
154 <p>You can override this by giving extra arguments; see below.
</p>
157 <h2><a name=
"implicit_upgrading_for_byte_strings">Implicit upgrading for byte strings
</a></h2>
158 <p>By default, if strings operating under byte semantics and strings
159 with Unicode character data are concatenated, the new string will
160 be created by decoding the byte strings as
<em>ISO
8859-
1 (Latin-
1)
</em>.
</p>
161 <p>The
<strong>encoding
</strong> pragma changes this to use the specified encoding
162 instead. For example:
</p>
165 my $string = chr(
20000); # a Unicode string
166 utf8::encode($string); # now it's a UTF-
8 encoded byte string
167 # concatenate with another Unicode string
168 print length($string . chr(
20000));
</pre>
169 <p>Will print
<code>2</code>, because
<code>$string
</code> is upgraded as UTF-
8. Without
170 <code>use encoding 'utf8';
</code>, it will print
<code>4</code> instead, since
<code>$string
</code>
171 is three octets when interpreted as Latin-
1.
</p>
175 <h1><a name=
"features_that_require_5_8_1">FEATURES THAT REQUIRE
5.8.1</a></h1>
176 <p>Some of the features offered by this pragma requires perl
5.8.1. Most
177 of these are done by Inaba Hiroto. Any other features and changes
178 are good for
5.8.0.
</p>
180 <dt><strong><a name=
"item__22non_2deuc_22_doublebyte_encodings">``NON-EUC'' doublebyte encodings
</a></strong>
183 <p>Because perl needs to parse script before applying this pragma, such
184 encodings as Shift_JIS and Big-
5 that may contain '\' (BACKSLASH;
185 \x5c) in the second byte fails because the second byte may
186 accidentally escape the quoting character that follows. Perl
5.8.1
187 or later fixes this problem.
</p>
190 <dt><strong><a name=
"item_tr_">tr//
</a></strong>
193 <p><a href=
"#item_tr_"><code>tr//
</code></a> was overlooked by Perl
5 porters when they released perl
5.8.0
194 See the section below for details.
</p>
197 <dt><strong><a name=
"item_data_pseudo_2dfilehandle">DATA pseudo-filehandle
</a></strong>
200 <p>Another feature that was overlooked was
<code>DATA
</code>.
</p>
207 <h1><a name=
"usage">USAGE
</a></h1>
209 <dt><strong><a name=
"item_use_encoding__5bencname_5d__3b">use encoding [
<em>ENCNAME
</em>] ;
</a></strong>
212 <p>Sets the script encoding to
<em>ENCNAME
</em>. And unless ${^UNICODE}
213 exists and non-zero, PerlIO layers of STDIN and STDOUT are set to
214 ``:encoding(
<em>ENCNAME
</em>)''.
</p>
217 <p>Note that STDERR WILL NOT be changed.
</p>
220 <p>Also note that non-STD file handles remain unaffected. Use
<code>use
221 open
</code> or
<a href=
"file://C|\msysgit\mingw\html/pod/perlfunc.html#item_binmode"><code>binmode
</code></a> to change layers of those.
</p>
224 <p>If no encoding is specified, the environment variable
<em>PERL_ENCODING
</em>
225 is consulted. If no encoding can be found, the error
<code>Unknown encoding
226 'I
<ENCNAME
>'
</code> will be thrown.
</p>
229 <dt><strong><a name=
"item_use_encoding_encname__5b_stdin__3d_3e_encname_in__">use encoding
<em>ENCNAME
</em> [ STDIN =
> <em>ENCNAME_IN
</em> ...] ;
</a></strong>
232 <p>You can also individually set encodings of STDIN and STDOUT via the
233 <code>STDIN =
> ENCNAME
</code> form. In this case, you cannot omit the
234 first
<em>ENCNAME
</em>.
<code>STDIN =
> undef
</code> turns the IO transcoding
238 <p>When ${^UNICODE} exists and non-zero, these options will completely
239 ignored. ${^UNICODE} is a variable introduced in perl
5.8.1. See
240 <a href=
"file://C|\msysgit\mingw\html/pod/perlrun.html">the perlrun manpage
</a> see
<a href=
"file://C|\msysgit\mingw\html/pod/perlvar.html#item____unicode_">${^UNICODE} in the perlvar manpage
</a> and
<a href=
"file://C|\msysgit\mingw\html/pod/perlrun.html#c">-C in the perlrun manpage
</a> for
241 details (perl
5.8.1 and later).
</p>
244 <dt><strong><a name=
"item_use_encoding_encname_filter_3d_3e1_3b">use encoding
<em>ENCNAME
</em> Filter=
>1;
</a></strong>
247 <p>This turns the encoding pragma into a source filter. While the
248 default approach just decodes interpolated literals (in
<code>qq()
</code> and
249 qr()), this will apply a source filter to the entire source code. See
250 <a href=
"#the_filter_option">The Filter Option
</a> below for details.
</p>
253 <dt><strong><a name=
"item_no_encoding_3b">no encoding;
</a></strong>
256 <p>Unsets the script encoding. The layers of STDIN, STDOUT are
257 reset to ``:raw'' (the default unprocessed raw stream of bytes).
</p>
264 <h1><a name=
"the_filter_option">The Filter Option
</a></h1>
265 <p>The magic of
<code>use encoding
</code> is not applied to the names of
266 identifiers. In order to make
<code>${
"\x{
4eba}
"}++
</code> ($human++, where human
267 is a single Han ideograph) work, you still need to write your script
268 in UTF-
8 -- or use a source filter. That's what 'Filter=
>1' does.
</p>
269 <p>What does this mean? Your source code behaves as if it is written in
270 UTF-
8 with 'use utf8' in effect. So even if your editor only supports
271 Shift_JIS, for example, you can still try examples in Chapter
15 of
272 <code>Programming Perl,
3rd Ed.
</code>. For instance, you can use UTF-
8
274 <p>This option is significantly slower and (as of this writing) non-ASCII
275 identifiers are not very stable WITHOUT this option and with the
276 source code written in UTF-
8.
</p>
279 <h2><a name=
"filterrelated_changes_at_encode_version_1_87">Filter-related changes at Encode version
1.87</a></h2>
282 <p>The Filter option now sets STDIN and STDOUT like non-filter options.
283 And
<code>STDIN=
>ENCODING
</code> and
<code>STDOUT=
>ENCODING
</code> work like
284 non-filter version.
</p>
287 <p><code>use utf8
</code> is implicitly declared so you no longer have to
<code>use
288 utf8
</code> to
<code>${
"\x{
4eba}
"}++
</code>.
</p>
294 <h1><a name=
"caveats">CAVEATS
</a></h1>
297 <h2><a name=
"not_scoped">NOT SCOPED
</a></h2>
298 <p>The pragma is a per script, not a per block lexical. Only the last
299 <code>use encoding
</code> or
<code>no encoding
</code> matters, and it affects
300 <strong>the whole script
</strong>. However, the
<no encoding
> pragma is supported and
301 <strong>use encoding
</strong> can appear as many times as you want in a given script.
302 The multiple use of this pragma is discouraged.
</p>
303 <p>By the same reason, the use this pragma inside modules is also
304 discouraged (though not as strongly discouraged as the case above.
306 <p>If you still have to write a module with this pragma, be very careful
307 of the load order. See the codes below;
</p>
310 package Module_IN_BAR;
311 use encoding
"bar
";
312 # stuff in
"bar
" encoding here
316 use encoding
"foo
"
318 # surprise! use encoding
"bar
" is in effect.
</pre>
319 <p>The best way to avoid this oddity is to use this pragma RIGHT AFTER
320 other modules are loaded. i.e.
</p>
323 use encoding
"foo
";
</pre>
326 <h2><a name=
"do_not_mix_multiple_encodings">DO NOT MIX MULTIPLE ENCODINGS
</a></h2>
327 <p>Notice that only literals (string or regular expression) having only
328 legacy code points are affected: if you mix data like this
</p>
331 <p>the data is assumed to be in (Latin
1 and) Unicode, not in your native
332 encoding. In other words, this will match in ``greek'':
</p>
334 "\xDF
" =~ /\x{
3af}/
</pre>
335 <p>but this will not
</p>
337 "\xDF\x{
100}
" =~ /\x{
3af}\x{
100}/
</pre>
338 <p>since the
<code>\xDF
</code> (ISO
8859-
7 GREEK SMALL LETTER IOTA WITH TONOS) on
339 the left will
<strong>not
</strong> be upgraded to
<code>\x{
3af}
</code> (Unicode GREEK SMALL
340 LETTER IOTA WITH TONOS) because of the
<code>\x{
100}
</code> on the left. You
341 should not be mixing your legacy data and Unicode in the same string.
</p>
342 <p>This pragma also affects encoding of the
0x80.
.0xFF code point range:
343 normally characters in that range are left as eight-bit bytes (unless
344 they are combined with characters with code points
0x100 or larger,
345 in which case all characters need to become UTF-
8 encoded), but if
346 the
<code>encoding
</code> pragma is present, even the
0x80.
.0xFF range always
347 gets UTF-
8 encoded.
</p>
348 <p>After all, the best thing about this pragma is that you don't have to
349 resort to \x{....} just to spell your name in a native encoding.
350 So feel free to put your strings in your encoding in quotes and
354 <h2><a name=
"tr____with_ranges">tr/// with ranges
</a></h2>
355 <p>The
<strong>encoding
</strong> pragma works by decoding string literals in
356 <a href=
"file://C|\msysgit\mingw\html/pod/perlfunc.html#item_q_"><code>q//,qq//,qr//,qw///, qx//
</code></a> and so forth. In perl
5.8.0, this
357 does not apply to
<a href=
"#item_tr_"><code>tr///
</code></a>. Therefore,
</p>
359 use encoding 'euc-jp';
361 $kana =~ tr/\xA4\xA1-\xA4\xF3/\xA5\xA1-\xA5\xF3/;
362 # -------- -------- -------- --------
</pre>
363 <p>Does not work as
</p>
365 $kana =~ tr/\x{
3041}-\x{
3093}/\x{
30a1}-\x{
30f3}/;
</pre>
367 <dt><strong><a name=
"item_legend_of_characters_above">Legend of characters above
</a></strong>
371 utf8 euc-jp charnames::viacode()
372 -----------------------------------------
373 \x{
3041} \xA4\xA1 HIRAGANA LETTER SMALL A
374 \x{
3093} \xA4\xF3 HIRAGANA LETTER N
375 \x{
30a1} \xA5\xA1 KATAKANA LETTER SMALL A
376 \x{
30f3} \xA5\xF3 KATAKANA LETTER N
</pre>
379 <p>This counterintuitive behavior has been fixed in perl
5.8.1.
</p>
382 <h3><a name=
"workaround_to_tr____">workaround to tr///;
</a></h3>
383 <p>In perl
5.8.0, you can work around as follows;
</p>
385 use encoding 'euc-jp';
387 eval qq{ \$kana =~ tr/\xA4\xA1-\xA4\xF3/\xA5\xA1-\xA5\xF3/ };
</pre>
388 <p>Note the
<a href=
"#item_tr_"><code>tr//
</code></a> expression is surrounded by
<code>qq{}
</code>. The idea behind
389 is the same as classic idiom that makes
<a href=
"#item_tr_"><code>tr///
</code></a> 'interpolate'.
</p>
391 tr/$from/$to/; # wrong!
392 eval qq{ tr/$from/$to/ }; # workaround.
</pre>
393 <p>Nevertheless, in case of
<strong>encoding
</strong> pragma even
<a href=
"file://C|\msysgit\mingw\html/pod/perlfunc.html#item_q_"><code>q//
</code></a> is affected so
394 <a href=
"#item_tr_"><code>tr///
</code></a> not being decoded was obviously against the will of Perl5
395 Porters so it has been fixed in Perl
5.8.1 or later.
</p>
399 <h1><a name=
"example__greekperl">EXAMPLE - Greekperl
</a></h1>
401 use encoding
"iso
8859-
7";
</pre>
403 # \xDF in ISO
8859-
7 (Greek) is \x{
3af} in Unicode.
</pre>
405 $a =
"\xDF
";
406 $b =
"\x{
100}
";
</pre>
408 printf
"%#x\n
", ord($a); # will print
0x3af, not
0xdf</pre>
412 # $c will be
"\x{
3af}\x{
100}
", not
"\x{df}\x{
100}
".
</pre>
414 # chr() is affected, and ...
</pre>
416 print
"mega\n
" if ord(chr(
0xdf)) ==
0x3af;
</pre>
418 # ... ord() is affected by the encoding pragma ...
</pre>
420 print
"tera\n
" if ord(pack(
"C
",
0xdf)) ==
0x3af;
</pre>
422 # ... as are eq and cmp ...
</pre>
424 print
"peta\n
" if
"\x{
3af}
" eq pack(
"C
",
0xdf);
425 print
"exa\n
" if
"\x{
3af}
" cmp pack(
"C
",
0xdf) ==
0;
</pre>
427 # ... but pack/unpack C are not affected, in case you still
428 # want to go back to your native encoding
</pre>
430 print
"zetta\n
" if unpack(
"C
", (pack(
"C
",
0xdf))) ==
0xdf;
</pre>
434 <h1><a name=
"known_problems">KNOWN PROBLEMS
</a></h1>
436 <dt><strong><a name=
"item_literals_in_regex_that_are_longer_than_127_bytes">literals in regex that are longer than
127 bytes
</a></strong>
439 <p>For native multibyte encodings (either fixed or variable length),
440 the current implementation of the regular expressions may introduce
441 recoding errors for regular expression literals longer than
127 bytes.
</p>
444 <dt><strong><a name=
"item_ebcdic">EBCDIC
</a></strong>
447 <p>The encoding pragma is not supported on EBCDIC platforms.
448 (Porters who are willing and able to remove this limitation are
452 <dt><strong><a name=
"item_format">format
</a></strong>
455 <p>This pragma doesn't work well with format because PerlIO does not
456 get along very well with it. When format contains non-ascii
457 characters it prints funny or gets ``wide character warnings''.
458 To understand it, try the code below.
</p>
462 # Save this one in utf8
463 # replace *non-ascii* with a non-ascii string
466 *non-ascii*@
>>>>>>>
469 $camel =
"*non-ascii*
";
470 binmode(STDOUT=
>':encoding(utf8)'); # bang!
472 print $camel,
"\n
"; # fine
</pre>
475 <p>Without binmode this happens to work but without binmode,
<a href=
"file://C|\msysgit\mingw\html/pod/perlfunc.html#item_print"><code>print()
</code></a>
476 fails instead of write().
</p>
479 <p>At any rate, the very use of format is questionable when it comes to
480 unicode characters since you have to consider such things as character
481 width (i.e. double-width for ideographs) and directions (i.e. BIDI for
482 Arabic and Hebrew).
</p>
488 <h2><a name=
"the_logic_of__locale">The Logic of :locale
</a></h2>
489 <p>The logic of
<code>:locale
</code> is as follows:
</p>
492 <p>If the platform supports the
<code>langinfo(CODESET)
</code> interface, the codeset
493 returned is used as the default encoding for the open pragma.
</p>
496 <p>If
1. didn't work but we are under the locale pragma, the environment
497 variables LC_ALL and LANG (in that order) are matched for encodings
498 (the part after
<code>.
</code>, if any), and if any found, that is used
499 as the default encoding for the open pragma.
</p>
502 <p>If
1. and
2. didn't work, the environment variables LC_ALL and LANG
503 (in that order) are matched for anything looking like UTF-
8, and if
504 any found,
<code>:utf8
</code> is used as the default encoding for the open
508 <p>If your locale environment variables (LC_ALL, LC_CTYPE, LANG)
509 contain the strings 'UTF-
8' or 'UTF8' (case-insensitive matching),
510 the default encoding of your STDIN, STDOUT, and STDERR, and of
511 <strong>any subsequent file open
</strong>, is UTF-
8.
</p>
515 <h1><a name=
"history">HISTORY
</a></h1>
516 <p>This pragma first appeared in Perl
5.8.0. For features that require
517 5.8.1 and better, see above.
</p>
518 <p>The
<code>:locale
</code> subpragma was implemented in
2.01, or Perl
5.8.6.
</p>
522 <h1><a name=
"see_also">SEE ALSO
</a></h1>
523 <p><a href=
"file://C|\msysgit\mingw\html/pod/perlunicode.html">the perlunicode manpage
</a>,
<a href=
"file://C|\msysgit\mingw\html/lib/Encode.html">the Encode manpage
</a>,
<a href=
"file://C|\msysgit\mingw\html/lib/open.html">the open manpage
</a>,
<a href=
"file://C|\msysgit\mingw\html/lib/Filter/Util/Call.html">the Filter::Util::Call manpage
</a>,
</p>
524 <p>Ch.
15 of
<code>Programming Perl (
3rd Edition)
</code>
525 by Larry Wall, Tom Christiansen, Jon Orwant;
526 O'Reilly
& Associates; ISBN
0-
596-
00027-
8</p>
527 <table border=
"0" width=
"100%" cellspacing=
"0" cellpadding=
"3">
528 <tr><td class=
"block" style=
"background-color: #cccccc" valign=
"middle">
529 <big><strong><span class=
"block"> encoding - allows you to write your script in non-ascii or non-utf8
</span></strong></big>