Tutorial on Perl and Unicode
This is a tutorial on using Unicode in Perl.
Contents
Two types of strings
Strings in Perl can be one of two types. They can be either strings of bytes, or strings of Unicode characters.
In the following program, $x
is a string of bytes.
#!/usr/bin/perl my $x = '☺'; print length $x, " $x\n";
This prints out
3 ☺
The length of $x
is 3.
In the following program, $x
is a string of Unicode characters.
#!/usr/bin/perl use utf8; my $x = '☺'; print length $x, " $x\n";
The string use utf8;
tells Perl to process the program's
text as Unicode characters in the UTF-8 format.
This prints out
Wide character in print at use-utf8.pl line 4. 1 ☺
The length of $x
is 1.
Convert the two types of strings
The Encode module converts between strings of Unicode characters and strings of bytes.
#!/usr/bin/perl use Encode qw/encode_utf8 decode_utf8/; my $bytes = '☺'; my $unicode = decode_utf8 ($bytes); if ($unicode eq $bytes) { print "Same.\n"; } else { print "Different.\n"; } my $bytes_again = encode_utf8 ($unicode); if ($bytes_again eq $bytes) { print "Same.\n"; } else { print "Different.\n"; }
This prints out
Different. Same.
Print Unicode characters
To print Unicode characters,
#!/usr/bin/perl use utf8; binmode STDOUT, ":utf8"; my $x = '☺'; print "$x\n";
"Wide character in print" appeared when the second program was run because Perl did not know what to do with the Unicode character.
binmode STDOUT, ":utf8";
tells Perl to decode Unicode
characters into UTF-8-encoded bytes before printing them.
Read a file as Unicode characters
Perl can transform encodings when reading and writing files. For example, the "kanjidic" file is in the EUC-JP encoding. To read it in to Perl and treat each kanji character as a Unicode character,
#!/usr/bin/perl open my $fh, "<:encoding(euc-jp)", 'kanjidic';
Why use Unicode?
For processing non-English-language text documents. For example, in
regexes, a dot (.
) can match a kanji character.
#!/usr/bin/perl open my $fh, "<:encoding(euc-jp)", 'kanjidic'; binmode STDOUT, ':utf8'; while (<$fh>) { if (/^(.)/) { print "Your kanji is '$1'.\n"; } } close $fh;
This prints out
Your kanji is '#'. Your kanji is '亜'. Your kanji is '唖'. Your kanji is '娃'. Your kanji is '阿'. Your kanji is '哀'. Your kanji is '愛'. Your kanji is '挨'.
It is also possible to use special character classes which match certain types of Unicode characters,
#!/usr/bin/perl open my $fh, "<:encoding(euc-jp)", 'kanjidic'; binmode STDOUT, ':utf8'; while (<$fh>) { if (/^(\p{InCJKUnifiedIdeographs})/) { print "Your kanji is '$1'.\n"; } } close $fh;
The \p{InCJKUnifiedIdeographs}
tells it not to match the #
. It prints out
Your kanji is '亜'. Your kanji is '唖'. Your kanji is '娃'. Your kanji is '阿'. Your kanji is '哀'. Your kanji is '愛'. Your kanji is '挨'.
This kind of matching is possible with byte strings, but it is much more complicated.
Notes
The kanjidic file
The kanjidic file is a list of data about the Chinese characters ("kanji") used in Japanese. Kanjidic is a product of J.W. Breen's Electronic Dictionary Research and Development Group. It may be downloaded from The Monash Nihongo ftp Archive.
Web links
-
UTF-8 and Unicode FAQ for Unix/Linux
This Frequently Asked Questions by Marcus Kuhn details Unicode and UTF-8 encoding issues.
-
perluniintro - Perl Unicode introduction
This introduces how to use Unicode in Perl.
It is written for people who already know a lot about Perl.
-
perlunicode - Unicode support in Perl
This documents Unicode support in Perl.
It is a very technical reference.