Using Unicode in a Perl CGI script

Output Unicode

To use Unicode in a Perl CGI (Common Gateway Interface) program, the most convenient format is to encode the data in the UTF-8 format. In a CGI program, the Content-Type header should take the form

Content-Type: text/html; charset=UTF-8

With the CGI module from CPAN, this header may be obtained by using the option -charset when printing the header:

print header (-charset => 'UTF-8');

Alternatively, add the following to the program's HTML output:

<meta http-equiv="Content-Type" content="text/html;charset=utf-8">

This tells the web server to send a Content-Type header as shown above, rather than doing it via the actual header which the CGI program prints.

To print out Unicode characters which have been encoded by Perl, use

binmode STDOUT, ":utf8";

Without this, Perl will print warnings of the form "Wide character in print", which usually go to the error log file.

If the CGI program itself contains Unicode characters, turn on Perl's Unicode encoding using use utf8;. This tells Perl that the program text itself contains non-ASCII Unicode characters.

Input Unicode

If you have a CGI script using a GET method where the input comes from the value of the query itself (as found in $ENV{QUERY_STRING}), and the input contains Unicode in the form of percentage-encoded characters like "input=%E1%d3%99", you can decode it using

use URI::Escape;
use Encode 'decode_utf8';
my $query_string = $ENV{QUERY_STRING};
$query_string = uri_unescape ($query_string);
$query_string = decode_utf8 ($query_string);

In practice it is necessary to parse the query string before reading it because the percentage-encoded parts of the query string may contain equals signs or ampersands, so it will be impossible to distinguish form parameters from decoded parts.

Using the CGI module,

use CGI;
use Encode 'decode_utf8';
my $value = params ('input');
$value = decode_utf8 ($value);

This can be simplified using the -utf8 option to CGI:

use CGI '-utf8';
my $value = params ('input');

(According to CGI's documentation, the -utf8 may cause problems with POST requests containing binary files.)


Copyright © Ben Bullock 2009-2023. All rights reserved. For comments, questions, and corrections, please email Ben Bullock (benkasminbullock@gmail.com) or use the discussion group at Google Groups. / Privacy / Disclaimer