Tutorial on Perl and Unicode

This is a tutorial on using Unicode in Perl.

Two types of strings

Strings in Perl can be one of two types. They can be either strings of bytes, or strings of Unicode characters.

In the following program, $x is a string of bytes.

#!/usr/bin/perl
my $x = '☺';
print length $x, " $x\n";

Download it here.

This prints out

3 ☺
The length of $x is 3.

In the following program, $x is a string of Unicode characters.

#!/usr/bin/perl
use utf8;
my $x = '☺';
print length $x, " $x\n";

Download it here.

The string use utf8; tells Perl to process the program's text as Unicode characters in the UTF-8 format.

This prints out

Wide character in print at use-utf8.pl line 4.
1 ☺
The length of $x is 1.

How to convert the two types of strings

The Encode module converts between strings of Unicode characters and strings of bytes.

#!/usr/bin/perl
use Encode qw/encode_utf8 decode_utf8/;
my $bytes = '☺';
my $unicode = decode_utf8 ($bytes);
if ($unicode eq $bytes) {
    print "Same.\n";
}
else {
    print "Different.\n";
}
my $bytes_again = encode_utf8 ($unicode);
if ($bytes_again eq $bytes) {
    print "Same.\n";
}
else {
    print "Different.\n";
}


Download it here.

This prints out
Different.
Same.

How to print Unicode characters

It is necessary to tell Perl how to deal with Unicode characters when printing them:

#!/usr/bin/perl
use utf8;
binmode STDOUT, ":utf8";
my $x = '☺';
print "$x\n";

Download it here.

The reason why the message "Wide character in print" appeared when the second program was run was that Perl did not know what to do with the Unicode character. When Perl does not know what to do, it prints the character in the UTF-8 format, and prints a warning message.

How to read a file as Unicode characters

Perl can transform encodings when reading and writing files. For example, the "kanjidic" file is in the EUC-JP encoding. To read it in to Perl and treat each kanji character as a Unicode character,

#!/usr/bin/perl
open my $fh, "<:encoding(euc-jp)", 'kanjidic';

Download it here.

What advantages do Unicode strings have?

The advantage of using Unicode strings rather than byte strings is in processing non-English-language text documents. For example, in regular expressions, a dot (.) can match a kanji character.

#!/usr/bin/perl
open my $fh, "<:encoding(euc-jp)", 'kanjidic';
binmode STDOUT, ':utf8';
while (<$fh>) {
    if (/^(.)/) {
        print "Your kanji is '$1'.\n";
    }
}
close $fh;

Download it here.

This prints out
Your kanji is '#'.
Your kanji is '亜'.
Your kanji is '唖'.
Your kanji is '娃'.
Your kanji is '阿'.
Your kanji is '哀'.
Your kanji is '愛'.
Your kanji is '挨'.

It is also possible to use special character classes which match certain types of Unicode characters,

#!/usr/bin/perl
open my $fh, "<:encoding(euc-jp)", 'kanjidic';
binmode STDOUT, ':utf8';
while (<$fh>) {
    if (/^(\p{InCJKUnifiedIdeographs})/) {
        print "Your kanji is '$1'.\n";
    }
}
close $fh;

Download it here.

The \p{InCJKUnifiedIdeographs} tells it not to match the #. It prints out
Your kanji is '亜'.
Your kanji is '唖'.
Your kanji is '娃'.
Your kanji is '阿'.
Your kanji is '哀'.
Your kanji is '愛'.
Your kanji is '挨'.
This kind of matching is possible with byte strings, but it is much more complicated.


Copyright © Ben Bullock 2009-2012. All rights reserved. For comments, questions, and corrections, please email Ben Bullock (ben.bullock@lemoda.net) / Privacy / Disclaimer