A Hybrid Statistical and Rule-based Approach to Extremely Low-resource Machine Transliteration
Machine transliteration work has focused primarily on languages with large volumes of parallel corpus, and between language pairs whose orthographies are very different. In contrast, a large proportion of the world’s languages have vastly fewer resources and employ Roman-like alphabets often with large degrees of orthographic overlap with high-resource languages. We propose that machine transliteration between languages with few training examples can be accomplished by a noisy-channel-like statistical model captured in a human editable format with practical rule-based capabilities built-in. This hybrid approach allows users to take advantage of an algorithm to find and apply common transformations in context while providing rigorous control over the output. Effectiveness is evaluated on the Bible names translation matrix dataset of Wu et al. (2018), covering 591 languages that involve 590 names on average per language pair. Our approach slightly exceeds past results and explores several features targeted at benefiting the extremely low-resource language domain.