Guessing the encoding of a Japanese web page with Encode::Guess

Some Japanese web pages don't have any encoding information on them. Take サーフシティー・ブルース for example. The <head> part of the page contains no information at all on what encoding it might be:

<!DOCTYPE HTML PUBLIC "HTML 3.2">
<!-- IBM HomePage Builder for Windows  version 2.0 -->
<!-- Wed Dec 31 18:19:03 1997 -->

<HTML>
<HEAD>
<TITLE></TITLE>
</HEAD>
and the HTTP headers, saved to a file headers by
curl -D headers http://homepage3.nifty.com/yusaku/tantei/02.htm
don't tell you anything either:
HTTP/1.1 200 OK
Server: Zeus/4.3
Date: Tue, 02 Nov 2010 10:12:35 GMT
Content-Length: 4832
Accept-Ranges: bytes
Content-Type: text/html
Last-Modified: Tue, 10 Feb 2004 08:02:15 GMT

Your web browser probably has a built-in way to guess the encoding, but what if you are looking at the page some other way, such as downloading? Encode::Guess to the rescue.

This simple script downloads the file and decodes it into UTF-8 using Perl's Encode::Guess module, part of Perl's standard distribution:

#!/home/ben/software/install/bin/perl
use warnings;
use strict;
use Encode 'decode';
use Encode::Guess;
use LWP::Simple;
binmode STDOUT, ":utf8";
my $page = 'http://homepage3.nifty.com/yusaku/tantei/02.htm';
my $text = get ($page);
# Try some Japanese encodings
my $encoding = guess_encoding ($text, qw/euc-jp shiftjis iso-2022-jp/);
if ($encoding) {
    $text = decode ($encoding, $text);
    print $text;
}
else {
    warn "Could not guess encoding of '$page'";
}


Copyright © Ben Bullock 2009-2012. All rights reserved. For comments, questions, and corrections, please email Ben Bullock (ben.bullock@lemoda.net) / Privacy / Disclaimer