<<Taikichiro
Mori Memorial Research Fund>>
Graduate
Student Researcher Development Grant Report
Research
Project: Japanese-Vietnamese
parallel corpus construction and its application
Project
Researcher: Vo
Ho Bao Khanh
Research
Contents
The
research project works with two main tasks. The first task is building a
Japanese and Vietnamese parallel corpus which is the very first corpus of both
the languages in the world. The second concerns with detecting corresponding
compound words in Japanese-Vietnamese bilingual texts that can be applied in
other researches such as co-reference resolution, machine translation, bilingual
information retrieval, etc.
The
parallel bilingual corpus is a large-scale collection of articles, paragraphs or
sentences which contain correspondences of the original text in source language
and its translation in the target language. The original data which can be
obtained from newspaper articles or journals will be translated by skilled
translators. Also, the Japanese-Vietnamese corpus can be directly collected from
dictionary example data. After getting raw parallel bilingual corpus, I will
perform the main task of corpus construction – annotating the corpus with
necessary Part-Of-Speech tags in order to exploit the corpus for another
specific task.
The
second task of the research process is to determine corresponding compound words
in both Japanese and Vietnamese texts. It is an application of bilingual
parallel corpus construction in which this bilingual corpus plays the role of
testing and training data. I decided to work on compound words because compound
words appear most in Japanese texts and Vietnamese texts as well, it is crucial
to detect corresponding Japanese – Vietnamese compound nouns and compound verbs.
Most of Japanese compound words consist of Sino-Japanese (Kanji) words, and
similarly, Vietnamese compound words also consist of Sino-Vietnamese (Han Viet)
words. Among these Sino-Japanese and Sino-Vietnamese words, there are
corresponding words with quite similar structures. Due to this attribute,
detecting corresponding Japanese-Vietnamese compound words directly becomes the
main processing. In case the Japanese-Vietnamese dictionary
is not sufficient enough, the indirect approach with Japanese-English and
English-Vietnamese dictionary is required because these dictionaries consist of
a bigger number of words.
Research
Activity Results
The
research progress has achieved some fruits that follow the research schedule
from the beginning of this semester. Typically, the research results are:
a.
A parallel corpus extracted from the Japanese-Vietnamese dictionary. This corpus
contains about 20.000 pair sentences of both languages.
b. Clear
vision about approaches detecting Japanese compound words and multilingual
compound words. In last semesters, I have researched many problems relating to
detecting and translating of bilingual noun phrases as well as compound words. I
have figured out what method to be used in solving my problem from now.
c. Experiment
some programs to process with the dictionary, structure of Japanese compound
words and Vietnamese morphological problems. This is the very important task to
do because from knowing these structures, I can conclude the rules of
corresponding compound words. There are a wide range of tasks to do next basing
on these rules afterward.
Conclusion
In
general, I have finished the main tasks that I proposed in Mori Grant about one
year ago. I have finished the task of making a parallel corpus with necessary
information. I am working on translating compound words from Japanese to
Vietnamese and finding them in Japanese and Vietnamese texts, which are supposed
to finish in March.