Tutorial on Perl and Unicode

This is a tutorial on using Unicode in Perl.

Two types of strings
Convert the two types of strings
Print Unicode characters
Read a file as Unicode characters
Why use Unicode?
Notes

Two types of strings

Strings in Perl can be one of two types. They can be either strings of bytes, or strings of Unicode characters.

In the following program, $x is a string of bytes.

#!/usr/bin/perl
my $x = '☺';
print length $x, " $x\n";

(download)

This prints out

3 ☺

The length of $x is 3.

In the following program, $x is a string of Unicode characters.

#!/usr/bin/perl
use utf8;
my $x = '☺';
print length $x, " $x\n";

(download)

The string use utf8; tells Perl to process the program's text as Unicode characters in the UTF-8 format.

This prints out

Wide character in print at use-utf8.pl line 4.
1 ☺

The length of $x is 1.

Convert the two types of strings

The Encode module converts between strings of Unicode characters and strings of bytes.

#!/usr/bin/perl
use Encode qw/encode_utf8 decode_utf8/;
my $bytes = '☺';
my $unicode = decode_utf8 ($bytes);
if ($unicode eq $bytes) {
    print "Same.\n";
}
else {
    print "Different.\n";
}
my $bytes_again = encode_utf8 ($unicode);
if ($bytes_again eq $bytes) {
    print "Same.\n";
}
else {
    print "Different.\n";
}

(download)

This prints out

Different.
Same.

Print Unicode characters

To print Unicode characters,

#!/usr/bin/perl
use utf8;
binmode STDOUT, ":utf8";
my $x = '☺';
print "$x\n";

(download)

"Wide character in print" appeared when the second program was run because Perl did not know what to do with the Unicode character.

binmode STDOUT, ":utf8"; tells Perl to decode Unicode characters into UTF-8-encoded bytes before printing them.

Read a file as Unicode characters

Perl can transform encodings when reading and writing files. For example, the "kanjidic" file is in the EUC-JP encoding. To read it in to Perl and treat each kanji character as a Unicode character,

#!/usr/bin/perl
open my $fh, "<:encoding(euc-jp)", 'kanjidic';

(download)

Why use Unicode?

For processing non-English-language text documents. For example, in regexes, a dot (.) can match a kanji character.

#!/usr/bin/perl
open my $fh, "<:encoding(euc-jp)", 'kanjidic';
binmode STDOUT, ':utf8';
while (<$fh>) {
    if (/^(.)/) {
        print "Your kanji is '$1'.\n";
    }
}
close $fh;

(download)

This prints out

Your kanji is '#'.
Your kanji is '亜'.
Your kanji is '唖'.
Your kanji is '娃'.
Your kanji is '阿'.
Your kanji is '哀'.
Your kanji is '愛'.
Your kanji is '挨'.

It is also possible to use special character classes which match certain types of Unicode characters,

#!/usr/bin/perl
open my $fh, "<:encoding(euc-jp)", 'kanjidic';
binmode STDOUT, ':utf8';
while (<$fh>) {
    if (/^(\p{InCJKUnifiedIdeographs})/) {
        print "Your kanji is '$1'.\n";
    }
}
close $fh;

(download)

The \p{InCJKUnifiedIdeographs} tells it not to match the #. It prints out

Your kanji is '亜'.
Your kanji is '唖'.
Your kanji is '娃'.
Your kanji is '阿'.
Your kanji is '哀'.
Your kanji is '愛'.
Your kanji is '挨'.

This kind of matching is possible with byte strings, but it is much more complicated.

Notes

The kanjidic file

The kanjidic file is a list of data about the Chinese characters ("kanji") used in Japanese. Kanjidic is a product of J.W. Breen's Electronic Dictionary Research and Development Group. It may be downloaded from The Monash Nihongo ftp Archive.

Web links

UTF-8 and Unicode FAQ for Unix/Linux
This Frequently Asked Questions by Marcus Kuhn details Unicode and UTF-8 encoding issues.
perluniintro - Perl Unicode introduction
This introduces how to use Unicode in Perl.

It is written for people who already know a lot about Perl.
perlunicode - Unicode support in Perl
This documents Unicode support in Perl.

It is a very technical reference.

Tutorial on Perl and Unicode

Contents

Two types of strings

Convert the two types of strings

Print Unicode characters

Read a file as Unicode characters

Why use Unicode?

Notes

The kanjidic file

Web links