Some Japanese web pages don't have any encoding information on
them. Take サーフシティー・ブルース for example. The <head>
part of the page contains no information at all on what encoding it
might be:
<!DOCTYPE HTML PUBLIC "HTML 3.2"> <!-- IBM HomePage Builder for Windows version 2.0 --> <!-- Wed Dec 31 18:19:03 1997 --> <HTML> <HEAD> <TITLE></TITLE> </HEAD>and the HTTP headers, saved to a file
headers by
curl -D headers http://homepage3.nifty.com/yusaku/tantei/02.htmdon't tell you anything either:
HTTP/1.1 200 OK Server: Zeus/4.3 Date: Tue, 02 Nov 2010 10:12:35 GMT Content-Length: 4832 Accept-Ranges: bytes Content-Type: text/html Last-Modified: Tue, 10 Feb 2004 08:02:15 GMTYour web browser probably has a built-in way to guess the encoding, but what if you are looking at the page some other way, such as downloading? Encode::Guess to the rescue.
This simple script downloads the file and decodes it into UTF-8 using Perl's Encode::Guess module, part of Perl's standard distribution:
#!/home/ben/software/install/bin/perl use warnings; use strict; use Encode 'decode'; use Encode::Guess; use LWP::Simple; binmode STDOUT, ":utf8"; my $page = 'http://homepage3.nifty.com/yusaku/tantei/02.htm'; my $text = get ($page); # Try some Japanese encodings my $encoding = guess_encoding ($text, qw/euc-jp shiftjis iso-2022-jp/); if ($encoding) { $text = decode ($encoding, $text); print $text; } else { warn "Could not guess encoding of '$page'"; }