Example of Text::Fuzzy for fuzzy searches on kana

This is an example of using Text::Fuzzy with Unicode-encoded characters. The dictionary Edict is a freely-downloadable Japanese to English electronic dictionary.

#!/home/ben/software/install/bin/perl
use warnings;
use strict;
use Lingua::JA::Moji ':all';
use Text::Fuzzy;
use utf8;
binmode STDOUT, ":utf8";
my $infile = '/home/ben/data/edrdg/edict';
open my $in, "<:encoding(EUC-JP)", $infile or die $!;
my @kana;
while (<$in>) {
    my $kana;
    if (/\[(\p{InKana}+)\]/) {
        $kana = $1;
    }
    elsif (/^(\p{InKana}+)/) {
        $kana = $1;
    }
    if ($kana) {
        $kana = kana2katakana ($kana);
        push @kana, $kana;
    }
}
printf "Starting fuzzy searches over %d lines.\n", scalar @kana;
search ('ウオソウコ');
search ('アイウエオカキクケコバビブベボハヒフヘホ');
search ('アルベルトアインシュタイン');
search ('バババブ');
search ('バババブアルベルト');
exit;

sub search
{
    my ($silly) = @_;
    my $max = 10;
    my $search = Text::Fuzzy->new ($silly, max => $max);
    my $n = $search->nearest (\@kana);
    if (defined $n) {
        printf "$silly nearest is $kana[$n] (distance %d)\n",
            $search->last_distance ();
    }
    else {
        printf "Nothing like '$silly' was found within $max edits.\n";
    }
}

(download)

The output of the program looks something like this:

Starting fuzzy searches over 213359 lines.
ウオソウコ nearest is ウソウ (distance 2)
Fuzzy search took 0.173445 seconds.
Nothing like 'アイウエオカキクケコバビブベボハヒフヘホ' was found within the edit distance 10.
Fuzzy search took 0.128962 seconds.
アルベルトアインシュタイン nearest is リヒテンシュタイン (distance 7)
Fuzzy search took 0.199817 seconds.

Copyright © Ben Bullock 2009-2023. All rights reserved. For comments, questions, and corrections, please email Ben Bullock (benkasminbullock@gmail.com) or use the discussion group at Google Groups. / Privacy / Disclaimer