2 <!DOCTYPE html PUBLIC
"-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
3 <html xmlns=
"http://www.w3.org/1999/xhtml">
5 <title>Encode::Guess -- Guesses encoding from data
</title>
6 <meta http-equiv=
"content-type" content=
"text/html; charset=utf-8" />
7 <link rev=
"made" href=
"mailto:" />
10 <body style=
"background-color: white">
11 <table border=
"0" width=
"100%" cellspacing=
"0" cellpadding=
"3">
12 <tr><td class=
"block" style=
"background-color: #cccccc" valign=
"middle">
13 <big><strong><span class=
"block"> Encode::Guess -- Guesses encoding from data
</span></strong></big>
17 <p><a name=
"__index__"></a></p>
22 <li><a href=
"#name">NAME
</a></li>
23 <li><a href=
"#synopsis">SYNOPSIS
</a></li>
24 <li><a href=
"#abstract">ABSTRACT
</a></li>
25 <li><a href=
"#description">DESCRIPTION
</a></li>
26 <li><a href=
"#caveats">CAVEATS
</a></li>
27 <li><a href=
"#to_do">TO DO
</a></li>
28 <li><a href=
"#see_also">SEE ALSO
</a></li>
35 <h1><a name=
"name">NAME
</a></h1>
36 <p>Encode::Guess -- Guesses encoding from data
</p>
40 <h1><a name=
"synopsis">SYNOPSIS
</a></h1>
42 # if you are sure $data won't contain anything bogus
</pre>
45 use Encode::Guess qw/euc-jp shiftjis
7bit-jis/;
46 my $utf8 = decode(
"Guess
", $data);
47 my $data = encode(
"Guess
", $utf8); # this doesn't work!
</pre>
51 my $enc = guess_encoding($data, qw/euc-jp shiftjis
7bit-jis/);
52 ref($enc) or die
"Can't guess: $enc
"; # trap error this way
53 $utf8 = $enc-
>decode($data);
55 $utf8 = decode($enc-
>name, $data)
</pre>
59 <h1><a name=
"abstract">ABSTRACT
</a></h1>
60 <p>Encode::Guess enables you to guess in what encoding a given data is
61 encoded, or at least tries to.
</p>
65 <h1><a name=
"description">DESCRIPTION
</a></h1>
66 <p>By default, it checks only ascii, utf8 and UTF-
16/
32 with BOM.
</p>
68 use Encode::Guess; # ascii/utf8/BOMed UTF
</pre>
69 <p>To use it more practically, you have to give the names of encodings to
70 check (
<em>suspects
</em> as follows). The name of suspects can either be
71 canonical names or aliases.
</p>
72 <p>CAVEAT: Unlike UTF-(
16|
32), BOM in utf8 is NOT AUTOMATICALLY STRIPPED.
</p>
74 # tries all major Japanese Encodings as well
75 use Encode::Guess qw/euc-jp shiftjis
7bit-jis/;
</pre>
76 <p>If the
<code>$Encode::Guess::NoUTFAutoGuess
</code> variable is set to a true
77 value, no heuristics will be applied to UTF8/
16/
32, and the result
78 will be limited to the suspects and
<code>ascii
</code>.
</p>
80 <dt><strong><a name=
"item_set_suspects">Encode::Guess-
>set_suspects
</a></strong>
83 <p>You can also change the internal suspects list via
<a href=
"#item_set_suspects"><code>set_suspects
</code></a>
89 Encode::Guess-
>set_suspects(qw/euc-jp shiftjis
7bit-jis/);
</pre>
92 <dt><strong><a name=
"item_add_suspects">Encode::Guess-
>add_suspects
</a></strong>
95 <p>Or you can use
<a href=
"#item_add_suspects"><code>add_suspects
</code></a> method. The difference is that
96 <a href=
"#item_set_suspects"><code>set_suspects
</code></a> flushes the current suspects list while
97 <a href=
"#item_add_suspects"><code>add_suspects
</code></a> adds.
</p>
102 Encode::Guess-
>add_suspects(qw/euc-jp shiftjis
7bit-jis/);
103 # now the suspects are euc-jp,shiftjis,
7bit-jis, AND
104 # euc-kr,euc-cn, and big5-eten
105 Encode::Guess-
>add_suspects(qw/euc-kr euc-cn big5-eten/);
</pre>
108 <dt><strong><a name=
"item_decode">Encode::decode(``Guess'' ...)
</a></strong>
111 <p>When you are content with suspects list, you can now
</p>
115 my $utf8 = Encode::decode(
"Guess
", $data);
</pre>
118 <dt><strong><a name=
"item_guess">Encode::Guess-
><code>guess($data)
</code></a></strong>
121 <p>But it will croak if:
</p>
125 <p>Two or more suspects remain
</p>
128 <p>No suspects left
</p>
131 <p>So you should instead try this;
</p>
133 my $decoder = Encode::Guess-
>guess($data);
</pre>
134 <p>On success, $decoder is an object that is documented in
135 <a href=
"file://C|\msysgit\mingw\html/lib/Encode/Encoding.html">the Encode::Encoding manpage
</a>. So you can now do this;
</p>
137 my $utf8 = $decoder-
>decode($data);
</pre>
138 <p>On failure, $decoder now contains an error message so the whole thing
139 would be as follows;
</p>
141 my $decoder = Encode::Guess-
>guess($data);
142 die $decoder unless ref($decoder);
143 my $utf8 = $decoder-
>decode($data);
</pre>
144 <dt><strong><a name=
"item_guess_encoding">guess_encoding($data, [,
<em>list of suspects
</em>])
</a></strong>
147 <p>You can also try
<a href=
"#item_guess_encoding"><code>guess_encoding
</code></a> function which is exported by
148 default. It takes $data to check and it also takes the list of
149 suspects by option. The optional suspect list is
<em>not reflected
</em> to
150 the internal suspects list.
</p>
154 my $decoder = guess_encoding($data, qw/euc-jp euc-kr euc-cn/);
155 die $decoder unless ref($decoder);
156 my $utf8 = $decoder-
>decode($data);
157 # check only ascii and utf8
158 my $decoder = guess_encoding($data);
</pre>
165 <h1><a name=
"caveats">CAVEATS
</a></h1>
168 <p>Because of the algorithm used, ISO-
8859 series and other single-byte
169 encodings do not work well unless either one of ISO-
8859 is the only
170 one suspect (besides ascii and utf8).
</p>
174 my $decoder = guess_encoding($data, 'latin1');
176 my $decoder = guess_encoding($data, qw/latin1 greek/);
</pre>
177 <p>The reason is that Encode::Guess guesses encoding by trial and error.
178 It first splits $data into lines and tries to decode the line for each
179 suspect. It keeps it going until all but one encoding is eliminated
180 out of suspects list. ISO-
8859 series is just too successful for most
181 cases (because it fills almost all code points in \x00-\xff).
</p>
184 <p>Do not mix national standard encodings and the corresponding vendor
189 = guess_encoding($data, qw/shiftjis MacJapanese cp932/);
</pre>
190 <p>The reason is that vendor encoding is usually a superset of national
191 standard so it becomes too ambiguous for most cases.
</p>
194 <p>On the other hand, mixing various national standard encodings
195 automagically works unless $data is too short to allow for guessing.
</p>
197 # This is ok if $data is long enough
199 guess_encoding($data, qw/euc-cn
200 euc-jp shiftjis
7bit-jis
205 <p>DO NOT PUT TOO MANY SUSPECTS! Don't you try something like this!
</p>
207 my $decoder = guess_encoding($data,
208 Encode-
>encodings(
":all
"));
</pre>
211 <p>It is, after all, just a guess. You should alway be explicit when it
212 comes to encodings. But there are some, especially Japanese,
213 environment that guess-coding is a must. Use this module with care.
</p>
217 <h1><a name=
"to_do">TO DO
</a></h1>
218 <p>Encode::Guess does not work on EBCDIC platforms.
</p>
222 <h1><a name=
"see_also">SEE ALSO
</a></h1>
223 <p><a href=
"file://C|\msysgit\mingw\html/lib/Encode.html">the Encode manpage
</a>,
<a href=
"file://C|\msysgit\mingw\html/lib/Encode/Encoding.html">the Encode::Encoding manpage
</a></p>
224 <table border=
"0" width=
"100%" cellspacing=
"0" cellpadding=
"3">
225 <tr><td class=
"block" style=
"background-color: #cccccc" valign=
"middle">
226 <big><strong><span class=
"block"> Encode::Guess -- Guesses encoding from data
</span></strong></big>