This is a tutorial on using Unicode in Perl.
Strings in Perl can be one of two types. They can be either strings of bytes, or strings of Unicode characters.
In the following program, $x is a string of bytes.
#!/usr/bin/perl my $x = '☺'; print length $x, " $x\n";
This prints out
3 ☺The length of
$x is 3.
In the following program, $x is a string of Unicode characters.
#!/usr/bin/perl use utf8; my $x = '☺'; print length $x, " $x\n";The string
use utf8; tells Perl to process the program's
text as Unicode characters in the UTF-8 format.
This prints out
Wide character in print at use-utf8.pl line 4. 1 ☺The length of
$x is 1.
The Encode module converts between strings of Unicode characters and strings of bytes.
#!/usr/bin/perl use Encode qw/encode_utf8 decode_utf8/; my $bytes = '☺'; my $unicode = decode_utf8 ($bytes); if ($unicode eq $bytes) { print "Same.\n"; } else { print "Different.\n"; } my $bytes_again = encode_utf8 ($unicode); if ($bytes_again eq $bytes) { print "Same.\n"; } else { print "Different.\n"; }This prints out
Different. Same.
It is necessary to tell Perl how to deal with Unicode characters when printing them:
#!/usr/bin/perl use utf8; binmode STDOUT, ":utf8"; my $x = '☺'; print "$x\n";The reason why the message "Wide character in print" appeared when the second program was run was that Perl did not know what to do with the Unicode character. When Perl does not know what to do, it prints the character in the UTF-8 format, and prints a warning message.
Perl can transform encodings when reading and writing files. For example, the "kanjidic" file is in the EUC-JP encoding. To read it in to Perl and treat each kanji character as a Unicode character,
#!/usr/bin/perl open my $fh, "<:encoding(euc-jp)", 'kanjidic';
The advantage of using Unicode strings rather than byte strings is in processing non-English-language text documents. For example, in regular expressions, a dot (.) can match a kanji character.
#!/usr/bin/perl open my $fh, "<:encoding(euc-jp)", 'kanjidic'; binmode STDOUT, ':utf8'; while (<$fh>) { if (/^(.)/) { print "Your kanji is '$1'.\n"; } } close $fh;This prints out
Your kanji is '#'. Your kanji is '亜'. Your kanji is '唖'. Your kanji is '娃'. Your kanji is '阿'. Your kanji is '哀'. Your kanji is '愛'. Your kanji is '挨'.
It is also possible to use special character classes which match certain types of Unicode characters,
#!/usr/bin/perl open my $fh, "<:encoding(euc-jp)", 'kanjidic'; binmode STDOUT, ':utf8'; while (<$fh>) { if (/^(\p{InCJKUnifiedIdeographs})/) { print "Your kanji is '$1'.\n"; } } close $fh;The
\p{InCJKUnifiedIdeographs} tells it not to match the #. It prints out
Your kanji is '亜'. Your kanji is '唖'. Your kanji is '娃'. Your kanji is '阿'. Your kanji is '哀'. Your kanji is '愛'. Your kanji is '挨'.This kind of matching is possible with byte strings, but it is much more complicated.