mingw/html/lib/utf8.html

   1 <?xml version="1.0" ?>
   2 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
   3 <html xmlns="http://www.w3.org/1999/xhtml">
   4 <head>
   5 <title>utf8 - Perl pragma to enable/disable UTF-8 in source code</title>
   6 <meta http-equiv="content-type" content="text/html; charset=utf-8" />
   7 <link rev="made" href="mailto:" />
   8 </head>
   9
  10 <body style="background-color: white">
  11 <table border="0" width="100%" cellspacing="0" cellpadding="3">
  12 <tr><td class="block" style="background-color: #cccccc" valign="middle">
  13 <big><strong><span class="block">&nbsp;utf8 - Perl pragma to enable/disable UTF-8 in source code</span></strong></big>
  14 </td></tr>
  15 </table>
  16
  17 <p><a name="__index__"></a></p>
  18 <!-- INDEX BEGIN -->
  19
  20 <ul>
  21
  22         <li><a href="#name">NAME</a></li>
  23         <li><a href="#synopsis">SYNOPSIS</a></li>
  24         <li><a href="#description">DESCRIPTION</a></li>
  25         <ul>
  26
  27                 <li><a href="#utility_functions">Utility functions</a></li>
  28         </ul>
  29
  30         <li><a href="#bugs">BUGS</a></li>
  31         <li><a href="#see_also">SEE ALSO</a></li>
  32 </ul>
  33 <!-- INDEX END -->
  34
  35 <hr />
  36 <p>
  37 </p>
  38 <h1><a name="name">NAME</a></h1>
  39 <p>utf8 - Perl pragma to enable/disable UTF-8 (or UTF-EBCDIC) in source code</p>
  40 <p>
  41 </p>
  42 <hr />
  43 <h1><a name="synopsis">SYNOPSIS</a></h1>
  44 <pre>
  45     use utf8;
  46     no utf8;</pre>
  47 <pre>
  48     # Convert a Perl scalar to/from UTF-8.
  49     $num_octets = utf8::upgrade($string);
  50     $success    = utf8::downgrade($string[, FAIL_OK]);</pre>
  51 <pre>
  52     # Change the native bytes of a Perl scalar to/from UTF-8 bytes.
  53     utf8::encode($string);
  54     utf8::decode($string);</pre>
  55 <pre>
  56     $flag = utf8::is_utf8(STRING); # since Perl 5.8.1
  57     $flag = utf8::valid(STRING);</pre>
  58 <p>
  59 </p>
  60 <hr />
  61 <h1><a name="description">DESCRIPTION</a></h1>
  62 <p>The <code>use utf8</code> pragma tells the Perl parser to allow UTF-8 in the
  63 program text in the current lexical scope (allow UTF-EBCDIC on EBCDIC based
  64 platforms).  The <code>no utf8</code> pragma tells Perl to switch back to treating
  65 the source text as literal bytes in the current lexical scope.</p>
  66 <p>This pragma is primarily a compatibility device.  Perl versions
  67 earlier than 5.6 allowed arbitrary bytes in source code, whereas
  68 in future we would like to standardize on the UTF-8 encoding for
  69 source text.</p>
  70 <p><strong>Do not use this pragma for anything else than telling Perl that your
  71 script is written in UTF-8.</strong> The utility functions described below are
  72 useful for their own purposes, but they are not really part of the
  73 ``pragmatic'' effect.</p>
  74 <p>Until UTF-8 becomes the default format for source text, either this
  75 pragma or the <a href="file://C|\msysgit\mingw\html/lib/encoding.html">the encoding manpage</a> pragma should be used to recognize UTF-8
  76 in the source.  When UTF-8 becomes the standard source format, this
  77 pragma will effectively become a no-op.  For convenience in what
  78 follows the term <em>UTF-X</em> is used to refer to UTF-8 on ASCII and ISO
  79 Latin based platforms and UTF-EBCDIC on EBCDIC based platforms.</p>
  80 <p>See also the effects of the <code>-C</code> switch and its cousin, the
  81 <code>$ENV{PERL_UNICODE}</code>, in <a href="file://C|\msysgit\mingw\html/pod/perlrun.html">the perlrun manpage</a>.</p>
  82 <p>Enabling the <code>utf8</code> pragma has the following effect:</p>
  83 <ul>
  84 <li>
  85 <p>Bytes in the source text that have their high-bit set will be treated
  86 as being part of a literal UTF-8 character.  This includes most
  87 literals such as identifier names, string constants, and constant
  88 regular expression patterns.</p>
  89 <p>On EBCDIC platforms characters in the Latin 1 character set are
  90 treated as being part of a literal UTF-EBCDIC character.</p>
  91 </li>
  92 </ul>
  93 <p>Note that if you have bytes with the eighth bit on in your script
  94 (for example embedded Latin-1 in your string literals), <code>use utf8</code>
  95 will be unhappy since the bytes are most probably not well-formed
  96 UTF-8.  If you want to have such bytes and use utf8, you can disable
  97 utf8 until the end the block (or file, if at top level) by <code>no utf8;</code>.</p>
  98 <p>If you want to automatically upgrade your 8-bit legacy bytes to UTF-8,
  99 use the <a href="file://C|\msysgit\mingw\html/lib/encoding.html">the encoding manpage</a> pragma instead of this pragma.  For example, if
 100 you want to implicitly upgrade your ISO 8859-1 (Latin-1) bytes to UTF-8
 101 as used in e.g. <a href="file://C|\msysgit\mingw\html/pod/perlfunc.html#item_chr"><code>chr()</code></a> and <code>\x{...}</code>, try this:</p>
 102 <pre>
 103     use encoding &quot;latin-1&quot;;
 104     my $c = chr(0xc4);
 105     my $x = &quot;\x{c5}&quot;;</pre>
 106 <p>In case you are wondering: yes, <code>use encoding 'utf8';</code> works much
 107 the same as <code>use utf8;</code>.</p>
 108 <p>
 109 </p>
 110 <h2><a name="utility_functions">Utility functions</a></h2>
 111 <p>The following functions are defined in the <code>utf8::</code> package by the
 112 Perl core.  You do not need to say <code>use utf8</code> to use these and in fact
 113 you should not say that  unless you really want to have UTF-8 source code.</p>
 114 <ul>
 115 <li><strong><a name="item_upgrade">$num_octets = utf8::upgrade($string)</a></strong>
 116
 117 <p>Converts in-place the octet sequence in the native encoding
 118 (Latin-1 or EBCDIC) to the equivalent character sequence in <em>UTF-X</em>.
 119 <em>$string</em> already encoded as characters does no harm.
 120 Returns the number of octets necessary to represent the string as <em>UTF-X</em>.
 121 Can be used to make sure that the UTF-8 flag is on,
 122 so that <code>\w</code> or <a href="file://C|\msysgit\mingw\html/pod/perlfunc.html#item_lc"><code>lc()</code></a> work as Unicode on strings
 123 containing characters in the range 0x80-0xFF (on ASCII and
 124 derivatives).</p>
 125 <p><strong>Note that this function does not handle arbitrary encodings.</strong>
 126 Therefore <em>Encode.pm</em> is recommended for the general purposes.</p>
 127 <p>Affected by the encoding pragma.</p>
 128 </li>
 129 <li><strong><a name="item_downgrade">$success = utf8::downgrade($string[, FAIL_OK])</a></strong>
 130
 131 <p>Converts in-place the character sequence in <em>UTF-X</em>
 132 to the equivalent octet sequence in the native encoding (Latin-1 or EBCDIC).
 133 <em>$string</em> already encoded as octets does no harm.
 134 Returns true on success. On failure dies or, if the value of
 135 <code>FAIL_OK</code> is true, returns false.
 136 Can be used to make sure that the UTF-8 flag is off,
 137 e.g. when you want to make sure that the <a href="file://C|\msysgit\mingw\html/pod/perlvar.html#item_substr"><code>substr()</code></a> or <a href="file://C|\msysgit\mingw\html/pod/perlfunc.html#item_length"><code>length()</code></a> function
 138 works with the usually faster byte algorithm.</p>
 139 <p><strong>Note that this function does not handle arbitrary encodings.</strong>
 140 Therefore <em>Encode.pm</em> is recommended for the general purposes.</p>
 141 <p><strong>Not</strong> affected by the encoding pragma.</p>
 142 <p><strong>NOTE:</strong> this function is experimental and may change
 143 or be removed without notice.</p>
 144 </li>
 145 <li><strong><a name="item_encode">utf8::encode($string)</a></strong>
 146
 147 <p>Converts in-place the character sequence to the corresponding octet sequence
 148 in <em>UTF-X</em>.  The UTF-8 flag is turned off.  Returns nothing.</p>
 149 <p><strong>Note that this function does not handle arbitrary encodings.</strong>
 150 Therefore <em>Encode.pm</em> is recommended for the general purposes.</p>
 151 </li>
 152 <li><strong><a name="item_decode">utf8::decode($string)</a></strong>
 153
 154 <p>Attempts to convert in-place the octet sequence in <em>UTF-X</em>
 155 to the corresponding character sequence.  The UTF-8 flag is turned on
 156 only if the source string contains multiple-byte <em>UTF-X</em> characters.
 157 If <em>$string</em> is invalid as <em>UTF-X</em>, returns false; otherwise returns true.</p>
 158 <p><strong>Note that this function does not handle arbitrary encodings.</strong>
 159 Therefore <em>Encode.pm</em> is recommended for the general purposes.</p>
 160 <p><strong>NOTE:</strong> this function is experimental and may change
 161 or be removed without notice.</p>
 162 </li>
 163 <li><strong><a name="item_is_utf8">$flag = utf8::is_utf8(STRING)</a></strong>
 164
 165 <p>(Since Perl 5.8.1)  Test whether STRING is in UTF-8.  Functionally
 166 the same as Encode::is_utf8().</p>
 167 </li>
 168 <li><strong><a name="item_valid">$flag = utf8::valid(STRING)</a></strong>
 169
 170 <p>[INTERNAL] Test whether STRING is in a consistent state regarding
 171 UTF-8.  Will return true is well-formed UTF-8 and has the UTF-8 flag
 172 on <strong>or</strong> if string is held as bytes (both these states are 'consistent').
 173 Main reason for this routine is to allow Perl's testsuite to check
 174 that operations have left strings in a consistent state.  You most
 175 probably want to use utf8::is_utf8() instead.</p>
 176 </li>
 177 </ul>
 178 <p><code>utf8::encode</code> is like <code>utf8::upgrade</code>, but the UTF8 flag is
 179 cleared.  See <a href="file://C|\msysgit\mingw\html/pod/perlunicode.html">the perlunicode manpage</a> for more on the UTF8 flag and the C API
 180 functions <code>sv_utf8_upgrade</code>, <code>sv_utf8_downgrade</code>, <code>sv_utf8_encode</code>,
 181 and <code>sv_utf8_decode</code>, which are wrapped by the Perl functions
 182 <code>utf8::upgrade</code>, <code>utf8::downgrade</code>, <code>utf8::encode</code> and
 183 <code>utf8::decode</code>.  Note that in the Perl 5.8.0 and 5.8.1 implementation
 184 the functions utf8::is_utf8, utf8::valid, utf8::encode, utf8::decode,
 185 utf8::upgrade, and utf8::downgrade are always available, without a
 186 <code>require utf8</code> statement-- this may change in future releases.</p>
 187 <p>
 188 </p>
 189 <hr />
 190 <h1><a name="bugs">BUGS</a></h1>
 191 <p>One can have Unicode in identifier names, but not in package/class or
 192 subroutine names.  While some limited functionality towards this does
 193 exist as of Perl 5.8.0, that is more accidental than designed; use of
 194 Unicode for the said purposes is unsupported.</p>
 195 <p>One reason of this unfinishedness is its (currently) inherent
 196 unportability: since both package names and subroutine names may need
 197 to be mapped to file and directory names, the Unicode capability of
 198 the filesystem becomes important-- and there unfortunately aren't
 199 portable answers.</p>
 200 <p>
 201 </p>
 202 <hr />
 203 <h1><a name="see_also">SEE ALSO</a></h1>
 204 <p><a href="file://C|\msysgit\mingw\html/pod/perluniintro.html">the perluniintro manpage</a>, <a href="file://C|\msysgit\mingw\html/lib/encoding.html">the encoding manpage</a>, <a href="file://C|\msysgit\mingw\html/pod/perlrun.html">the perlrun manpage</a>, <a href="file://C|\msysgit\mingw\html/lib/bytes.html">the bytes manpage</a>, <a href="file://C|\msysgit\mingw\html/pod/perlunicode.html">the perlunicode manpage</a></p>
 205 <table border="0" width="100%" cellspacing="0" cellpadding="3">
 206 <tr><td class="block" style="background-color: #cccccc" valign="middle">
 207 <big><strong><span class="block">&nbsp;utf8 - Perl pragma to enable/disable UTF-8 in source code</span></strong></big>
 208 </td></tr>
 209 </table>
 210
 211 </body>
 212
 213 </html>