Converting songs to romaji - example of Lingua::JA::Moji and Convert::Moji

This is an example script demonstrating the use of the Perl module Lingua::JA::Moji and Convert::Moji as well as MeCab (Yet Another Part-of-Speech and Morphological Analyzer) and the Encode module to create the romanized Japanese versions of the song translations on this web site.

This is the exact actual script. Table::Readable (link to github) reads a list of data like

j: マンジューシャカ
r: manjushaka
e: White flower of heaven
and turns it into an array of hash references,
({j => 'マンジューシャカ',
  r => 'manjushaka',
  e => 'White flower of heaven'},
);

I usually run this script as follows:

~/oneoff/make-romaji/make-romaji.pl lyrics-in.txt > lyrics.txt

The output of the script is not perfect since MeCab's output breaks things like -te form endings off verbs, so it needs a little work with a text editor, and also for some reason MeCab turns 一人 into ichi nin rather than hitori, etc. This sends input and output to and from MeCab using the EUC-JP encoding (see Encodings of Japanese - EUC-JP). (There probably is an option in MeCab to use Unicode but this was the path of least resistance for me.)

#!/home/ben/software/install/bin/perl

#                  _                                         _ _ 
#  _ __ ___   __ _| | _____       _ __ ___  _ __ ___   __ _ (_|_)
# | '_ ` _ \ / _` | |/ / _ \_____| '__/ _ \| '_ ` _ \ / _` || | |
# | | | | | | (_| |   <  __/_____| | | (_) | | | | | | (_| || | |
# |_| |_| |_|\__,_|_|\_\___|     |_|  \___/|_| |_| |_|\__,_|/ |_|
#                                                         |__/   

# Make a romaji version of a song.


use warnings;
use strict;
use autodie;
use utf8;
use Convert::Moji qw/make_regex/;
use Lingua::JA::Moji qw/kana2romaji/;
use Table::Readable qw/read_table/;
binmode STDOUT, "utf8";
my $input_file = $ARGV[0];
my @lyrics = read_table ($input_file);
my $mecab_input = "mecab-tmp.txt";
my $mecab_output = "mecab-out-tmp.txt";
my $encoding = 'utf8';
open my $mecab_temp, ">:encoding($encoding)", $mecab_input; 
for my $lyric (@lyrics) {
    print $mecab_temp $lyric->{j}, "\n";
}
system ("mecab $mecab_input > $mecab_output");
unlink $mecab_input;
open my $mecab_out_tmp, "<:encoding($encoding)", $mecab_output;
my %kanji2romaji;
while (<$mecab_out_tmp>) {
    if (/^(\w+)\s*(?:[^,]*,){8}([-ンー]+)\s*$/) {
#        print "$1, $2\n";
        $kanji2romaji{$1} = kana2romaji ($2, {style => "hepburn", ve_type => "wapuro"});
    }
    else {
#        print "$_ doesn't match yoru regex, fool.\n";
    }
}
unlink $mecab_output;
my $regex = make_regex (keys %kanji2romaji);
for my $lyric (@lyrics) {
    my $romaji = $lyric->{j};
    $romaji =~ s/$regex/ $kanji2romaji{$1}/g;
    $romaji =~ s/ッ t([ae])/tt$1/g;
    $romaji =~ s/\s+/ /g;
    $romaji =~ s/^\s+//;
    print "j: $lyric->{j}\n";
    print "r: $romaji\n";
    print "e: $lyric->{e}\n";
    if ($lyric->{other}) {
        print "other: $lyric->{other}\n";
    }
    print "\n";
}

Copyright © Ben Bullock 2009-2017. All rights reserved. For comments, questions, and corrections, please email Ben Bullock (benkasminbullock@gmail.com). / Privacy / Disclaimer