Install Perl 5.8.8
[msysgit.git] / mingw / html / lib / Encode / Guess.html
blob252ee79a18c5bf383dcac5d24e58fc1c22f4093b
1 <?xml version="1.0" ?>
2 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
3 <html xmlns="http://www.w3.org/1999/xhtml">
4 <head>
5 <title>Encode::Guess -- Guesses encoding from data</title>
6 <meta http-equiv="content-type" content="text/html; charset=utf-8" />
7 <link rev="made" href="mailto:" />
8 </head>
10 <body style="background-color: white">
11 <table border="0" width="100%" cellspacing="0" cellpadding="3">
12 <tr><td class="block" style="background-color: #cccccc" valign="middle">
13 <big><strong><span class="block">&nbsp;Encode::Guess -- Guesses encoding from data</span></strong></big>
14 </td></tr>
15 </table>
17 <p><a name="__index__"></a></p>
18 <!-- INDEX BEGIN -->
20 <ul>
22 <li><a href="#name">NAME</a></li>
23 <li><a href="#synopsis">SYNOPSIS</a></li>
24 <li><a href="#abstract">ABSTRACT</a></li>
25 <li><a href="#description">DESCRIPTION</a></li>
26 <li><a href="#caveats">CAVEATS</a></li>
27 <li><a href="#to_do">TO DO</a></li>
28 <li><a href="#see_also">SEE ALSO</a></li>
29 </ul>
30 <!-- INDEX END -->
32 <hr />
33 <p>
34 </p>
35 <h1><a name="name">NAME</a></h1>
36 <p>Encode::Guess -- Guesses encoding from data</p>
37 <p>
38 </p>
39 <hr />
40 <h1><a name="synopsis">SYNOPSIS</a></h1>
41 <pre>
42 # if you are sure $data won't contain anything bogus</pre>
43 <pre>
44 use Encode;
45 use Encode::Guess qw/euc-jp shiftjis 7bit-jis/;
46 my $utf8 = decode(&quot;Guess&quot;, $data);
47 my $data = encode(&quot;Guess&quot;, $utf8); # this doesn't work!</pre>
48 <pre>
49 # more elaborate way
50 use Encode::Guess;
51 my $enc = guess_encoding($data, qw/euc-jp shiftjis 7bit-jis/);
52 ref($enc) or die &quot;Can't guess: $enc&quot;; # trap error this way
53 $utf8 = $enc-&gt;decode($data);
54 # or
55 $utf8 = decode($enc-&gt;name, $data)</pre>
56 <p>
57 </p>
58 <hr />
59 <h1><a name="abstract">ABSTRACT</a></h1>
60 <p>Encode::Guess enables you to guess in what encoding a given data is
61 encoded, or at least tries to.</p>
62 <p>
63 </p>
64 <hr />
65 <h1><a name="description">DESCRIPTION</a></h1>
66 <p>By default, it checks only ascii, utf8 and UTF-16/32 with BOM.</p>
67 <pre>
68 use Encode::Guess; # ascii/utf8/BOMed UTF</pre>
69 <p>To use it more practically, you have to give the names of encodings to
70 check (<em>suspects</em> as follows). The name of suspects can either be
71 canonical names or aliases.</p>
72 <p>CAVEAT: Unlike UTF-(16|32), BOM in utf8 is NOT AUTOMATICALLY STRIPPED.</p>
73 <pre>
74 # tries all major Japanese Encodings as well
75 use Encode::Guess qw/euc-jp shiftjis 7bit-jis/;</pre>
76 <p>If the <code>$Encode::Guess::NoUTFAutoGuess</code> variable is set to a true
77 value, no heuristics will be applied to UTF8/16/32, and the result
78 will be limited to the suspects and <code>ascii</code>.</p>
79 <dl>
80 <dt><strong><a name="item_set_suspects">Encode::Guess-&gt;set_suspects</a></strong>
82 <dd>
83 <p>You can also change the internal suspects list via <a href="#item_set_suspects"><code>set_suspects</code></a>
84 method.</p>
85 </dd>
86 <dd>
87 <pre>
88 use Encode::Guess;
89 Encode::Guess-&gt;set_suspects(qw/euc-jp shiftjis 7bit-jis/);</pre>
90 </dd>
91 </li>
92 <dt><strong><a name="item_add_suspects">Encode::Guess-&gt;add_suspects</a></strong>
94 <dd>
95 <p>Or you can use <a href="#item_add_suspects"><code>add_suspects</code></a> method. The difference is that
96 <a href="#item_set_suspects"><code>set_suspects</code></a> flushes the current suspects list while
97 <a href="#item_add_suspects"><code>add_suspects</code></a> adds.</p>
98 </dd>
99 <dd>
100 <pre>
101 use Encode::Guess;
102 Encode::Guess-&gt;add_suspects(qw/euc-jp shiftjis 7bit-jis/);
103 # now the suspects are euc-jp,shiftjis,7bit-jis, AND
104 # euc-kr,euc-cn, and big5-eten
105 Encode::Guess-&gt;add_suspects(qw/euc-kr euc-cn big5-eten/);</pre>
106 </dd>
107 </li>
108 <dt><strong><a name="item_decode">Encode::decode(``Guess'' ...)</a></strong>
110 <dd>
111 <p>When you are content with suspects list, you can now</p>
112 </dd>
113 <dd>
114 <pre>
115 my $utf8 = Encode::decode(&quot;Guess&quot;, $data);</pre>
116 </dd>
117 </li>
118 <dt><strong><a name="item_guess">Encode::Guess-&gt;<code>guess($data)</code></a></strong>
120 <dd>
121 <p>But it will croak if:</p>
122 </dd>
123 <ul>
124 <li>
125 <p>Two or more suspects remain</p>
126 </li>
127 <li>
128 <p>No suspects left</p>
129 </li>
130 </ul>
131 <p>So you should instead try this;</p>
132 <pre>
133 my $decoder = Encode::Guess-&gt;guess($data);</pre>
134 <p>On success, $decoder is an object that is documented in
135 <a href="file://C|\msysgit\mingw\html/lib/Encode/Encoding.html">the Encode::Encoding manpage</a>. So you can now do this;</p>
136 <pre>
137 my $utf8 = $decoder-&gt;decode($data);</pre>
138 <p>On failure, $decoder now contains an error message so the whole thing
139 would be as follows;</p>
140 <pre>
141 my $decoder = Encode::Guess-&gt;guess($data);
142 die $decoder unless ref($decoder);
143 my $utf8 = $decoder-&gt;decode($data);</pre>
144 <dt><strong><a name="item_guess_encoding">guess_encoding($data, [, <em>list of suspects</em>])</a></strong>
146 <dd>
147 <p>You can also try <a href="#item_guess_encoding"><code>guess_encoding</code></a> function which is exported by
148 default. It takes $data to check and it also takes the list of
149 suspects by option. The optional suspect list is <em>not reflected</em> to
150 the internal suspects list.</p>
151 </dd>
152 <dd>
153 <pre>
154 my $decoder = guess_encoding($data, qw/euc-jp euc-kr euc-cn/);
155 die $decoder unless ref($decoder);
156 my $utf8 = $decoder-&gt;decode($data);
157 # check only ascii and utf8
158 my $decoder = guess_encoding($data);</pre>
159 </dd>
160 </li>
161 </dl>
163 </p>
164 <hr />
165 <h1><a name="caveats">CAVEATS</a></h1>
166 <ul>
167 <li>
168 <p>Because of the algorithm used, ISO-8859 series and other single-byte
169 encodings do not work well unless either one of ISO-8859 is the only
170 one suspect (besides ascii and utf8).</p>
171 <pre>
172 use Encode::Guess;
173 # perhaps ok
174 my $decoder = guess_encoding($data, 'latin1');
175 # definitely NOT ok
176 my $decoder = guess_encoding($data, qw/latin1 greek/);</pre>
177 <p>The reason is that Encode::Guess guesses encoding by trial and error.
178 It first splits $data into lines and tries to decode the line for each
179 suspect. It keeps it going until all but one encoding is eliminated
180 out of suspects list. ISO-8859 series is just too successful for most
181 cases (because it fills almost all code points in \x00-\xff).</p>
182 </li>
183 <li>
184 <p>Do not mix national standard encodings and the corresponding vendor
185 encodings.</p>
186 <pre>
187 # a very bad idea
188 my $decoder
189 = guess_encoding($data, qw/shiftjis MacJapanese cp932/);</pre>
190 <p>The reason is that vendor encoding is usually a superset of national
191 standard so it becomes too ambiguous for most cases.</p>
192 </li>
193 <li>
194 <p>On the other hand, mixing various national standard encodings
195 automagically works unless $data is too short to allow for guessing.</p>
196 <pre>
197 # This is ok if $data is long enough
198 my $decoder =
199 guess_encoding($data, qw/euc-cn
200 euc-jp shiftjis 7bit-jis
201 euc-kr
202 big5-eten/);</pre>
203 </li>
204 <li>
205 <p>DO NOT PUT TOO MANY SUSPECTS! Don't you try something like this!</p>
206 <pre>
207 my $decoder = guess_encoding($data,
208 Encode-&gt;encodings(&quot;:all&quot;));</pre>
209 </li>
210 </ul>
211 <p>It is, after all, just a guess. You should alway be explicit when it
212 comes to encodings. But there are some, especially Japanese,
213 environment that guess-coding is a must. Use this module with care.</p>
215 </p>
216 <hr />
217 <h1><a name="to_do">TO DO</a></h1>
218 <p>Encode::Guess does not work on EBCDIC platforms.</p>
220 </p>
221 <hr />
222 <h1><a name="see_also">SEE ALSO</a></h1>
223 <p><a href="file://C|\msysgit\mingw\html/lib/Encode.html">the Encode manpage</a>, <a href="file://C|\msysgit\mingw\html/lib/Encode/Encoding.html">the Encode::Encoding manpage</a></p>
224 <table border="0" width="100%" cellspacing="0" cellpadding="3">
225 <tr><td class="block" style="background-color: #cccccc" valign="middle">
226 <big><strong><span class="block">&nbsp;Encode::Guess -- Guesses encoding from data</span></strong></big>
227 </td></tr>
228 </table>
230 </body>
232 </html>