Converting songs to romaji - example of Lingua::JA::Moji and Convert::Moji
This is an example script demonstrating the use of the Perl module Lingua::JA::Moji and Convert::Moji as well as MeCab (Yet Another Part-of-Speech and Morphological Analyzer) and the Encode module to create the romanized Japanese versions of the song translations on this web site.
This is the exact actual script. Table::Readable (link to github) reads a list of data like
j: マンジューシャカ r: manjushaka e: White flower of heavenand turns it into an array of hash references,
({j => 'マンジューシャカ', r => 'manjushaka', e => 'White flower of heaven'}, );
I usually run this script as follows:
~/oneoff/make-romaji/make-romaji.pl lyrics-in.txt > lyrics.txt
The output of the script is not perfect since MeCab's output breaks things like -te form endings off verbs, so it needs a little work with a text editor, and also for some reason MeCab turns 一人 into ichi nin rather than hitori, etc. This sends input and output to and from MeCab using the EUC-JP encoding (see Encodings of Japanese - EUC-JP). (There probably is an option in MeCab to use Unicode but this was the path of least resistance for me.)
#!/home/ben/software/install/bin/perl # _ _ _ # _ __ ___ __ _| | _____ _ __ ___ _ __ ___ __ _ (_|_) # | '_ ` _ \ / _` | |/ / _ \_____| '__/ _ \| '_ ` _ \ / _` || | | # | | | | | | (_| | < __/_____| | | (_) | | | | | | (_| || | | # |_| |_| |_|\__,_|_|\_\___| |_| \___/|_| |_| |_|\__,_|/ |_| # |__/ # Make a romaji version of a song. use warnings; use strict; use autodie; use utf8; use Convert::Moji qw/make_regex/; use Lingua::JA::Moji qw/kana2romaji/; use Table::Readable qw/read_table/; binmode STDOUT, "utf8"; my $input_file = $ARGV[0]; my @lyrics = read_table ($input_file); my $mecab_input = "mecab-tmp.txt"; my $mecab_output = "mecab-out-tmp.txt"; my $encoding = 'utf8'; open my $mecab_temp, ">:encoding($encoding)", $mecab_input; for my $lyric (@lyrics) { print $mecab_temp $lyric->{j}, "\n"; } system ("mecab $mecab_input > $mecab_output"); unlink $mecab_input; open my $mecab_out_tmp, "<:encoding($encoding)", $mecab_output; my %kanji2romaji; while (<$mecab_out_tmp>) { if (/^(\w+)\s*(?:[^,]*,){8}([ア-ンー]+)\s*$/) { # print "$1, $2\n"; $kanji2romaji{$1} = kana2romaji ($2, {style => "hepburn", ve_type => "wapuro"}); } else { # print "$_ doesn't match yoru regex, fool.\n"; } } unlink $mecab_output; my $regex = make_regex (keys %kanji2romaji); for my $lyric (@lyrics) { my $romaji = $lyric->{j}; $romaji =~ s/$regex/ $kanji2romaji{$1}/g; $romaji =~ s/ッ t([ae])/tt$1/g; $romaji =~ s/\s+/ /g; $romaji =~ s/^\s+//; print "j: $lyric->{j}\n"; print "r: $romaji\n"; print "e: $lyric->{e}\n"; if ($lyric->{other}) { print "other: $lyric->{other}\n"; } print "\n"; }