Switching Win32::OLE to UTF-8

The default behaviour of Win32::OLE is to put strings into the native code page of the computer. This means that every text string needs to be encoded and decoded as it is sent backwards and forwards from Perl to Microsoft Word, Excel, and other OLE programs, or the Perl programmer has to not use Perl's built-in Unicode but instead use the local "code page".

Microsoft's Japanese encoding, Code Page 932, a variant of Shift-JIS, causes many problems. For example, the double byte space, Unicode 0x3000, contains an @ mark when encoded this way. The @ occurs as the second byte, causing hard-to-find errors, since it appears to be a space mark in the text editor. Because of this, I never use the "code pages" in a Perl program, but insist on using Perl's native Unicode encoding. However, this makes it necessary to convert every single piece of text to and from the native code page when interacting with Microsoft Office via OLE.

The following example program shows how to switch Win32::OLE from using the local code page of the computer to UTF-8. This has many advantages. For example, Win32::OLE automatically converts strings of bytes strings into strings of Unicode characters.

Note: This example program is meant to be run from a Cygwin terminal.

#! perl
use warnings;
use strict;
# Get the constant.
use Win32::OLE 'CP_UTF8';
# Set the code page of Win32::OLE.
$Win32::OLE::CP = CP_UTF8;
# This fixes the output for Cygwin on a Japanese environment.  Change
# "CP932" to your code page.
binmode STDOUT, 'encoding(CP932)';
# The test Microsoft Word file.
my $filename='C:\\Users\\ben\\Desktop\\test.doc';
# Start Microsoft Word. The 'Quit' makes Word stop when execution has
# finished.
my $word = Win32::OLE->new ('Word.Application', 'Quit') or
    die "Could not start Word: ". Win32::OLE->LastError ()."\n";
# Make sure it is visible (appears on the desktop).
$word->{Visible} = 1;
# Open the document. 
my $document = $word->Documents->Open ($filename);
if (!$document) {
    print "Could not open document:". Win32::OLE->LastError ()."\n";
}
# Get the document's text and print it. The text is processed into
# Perl's internal representation of Unicode automatically. That does
# not happen if you do not use CP_UTF8.
my $text = $document->Range->Text;
# Change vertical whitespace into Unix newlines for printing to Cygwin.
$text =~ s/\v/\n/g;
print $text;

To test this, make a Microsoft Word file of your choice containing some Unicode characters, and point the above script at it. Switching the code page like this also enables using file names in Perl's native encoding, so even if your file name contains non-ASCII characters, you don't have to send the file name to Encode::encode and decode.