Tutorial on Perl and Unicode

This is a tutorial on using Unicode in Perl.

Contents

Two types of strings

Strings in Perl can be one of two types. They can be either strings of bytes, or strings of Unicode characters.

In the following program, $x is a string of bytes.

#!/usr/bin/perl
my $x = '☺';
print length $x, " $x\n";

(download)

This prints out

3 ☺

The length of $x is 3.

In the following program, $x is a string of Unicode characters.

#!/usr/bin/perl
use utf8;
my $x = '☺';
print length $x, " $x\n";

(download)

The string use utf8; tells Perl to process the program's text as Unicode characters in the UTF-8 format.

This prints out

Wide character in print at use-utf8.pl line 4.
1 ☺

The length of $x is 1.

Convert the two types of strings

The Encode module converts between strings of Unicode characters and strings of bytes.

#!/usr/bin/perl
use Encode qw/encode_utf8 decode_utf8/;
my $bytes = '☺';
my $unicode = decode_utf8 ($bytes);
if ($unicode eq $bytes) {
    print "Same.\n";
}
else {
    print "Different.\n";
}
my $bytes_again = encode_utf8 ($unicode);
if ($bytes_again eq $bytes) {
    print "Same.\n";
}
else {
    print "Different.\n";
}


(download)

This prints out

Different.
Same.

To print Unicode characters,

#!/usr/bin/perl
use utf8;
binmode STDOUT, ":utf8";
my $x = '☺';
print "$x\n";

(download)

"Wide character in print" appeared when the second program was run because Perl did not know what to do with the Unicode character.

binmode STDOUT, ":utf8"; tells Perl to decode Unicode characters into UTF-8-encoded bytes before printing them.

Read a file as Unicode characters

Perl can transform encodings when reading and writing files. For example, the "kanjidic" file is in the EUC-JP encoding. To read it in to Perl and treat each kanji character as a Unicode character,

#!/usr/bin/perl
open my $fh, "<:encoding(euc-jp)", 'kanjidic';

(download)

Why use Unicode?

For processing non-English-language text documents. For example, in regexes, a dot (.) can match a kanji character.

#!/usr/bin/perl
open my $fh, "<:encoding(euc-jp)", 'kanjidic';
binmode STDOUT, ':utf8';
while (<$fh>) {
    if (/^(.)/) {
        print "Your kanji is '$1'.\n";
    }
}
close $fh;

(download)

This prints out

Your kanji is '#'.
Your kanji is '亜'.
Your kanji is '唖'.
Your kanji is '娃'.
Your kanji is '阿'.
Your kanji is '哀'.
Your kanji is '愛'.
Your kanji is '挨'.

It is also possible to use special character classes which match certain types of Unicode characters,

#!/usr/bin/perl
open my $fh, "<:encoding(euc-jp)", 'kanjidic';
binmode STDOUT, ':utf8';
while (<$fh>) {
    if (/^(\p{InCJKUnifiedIdeographs})/) {
        print "Your kanji is '$1'.\n";
    }
}
close $fh;

(download)

The \p{InCJKUnifiedIdeographs} tells it not to match the #. It prints out

Your kanji is '亜'.
Your kanji is '唖'.
Your kanji is '娃'.
Your kanji is '阿'.
Your kanji is '哀'.
Your kanji is '愛'.
Your kanji is '挨'.

This kind of matching is possible with byte strings, but it is much more complicated.


Notes

The kanjidic file

The kanjidic file is a list of data about the Chinese characters ("kanji") used in Japanese. Kanjidic is a product of J.W. Breen's Electronic Dictionary Research and Development Group. It may be downloaded from The Monash Nihongo ftp Archive.


Web links


Copyright © Ben Bullock 2009-2023. All rights reserved. For comments, questions, and corrections, please email Ben Bullock (benkasminbullock@gmail.com) or use the discussion group at Google Groups. / Privacy / Disclaimer