<<Taikichiro Mori Memorial Research Fund>>
Graduate Student Researcher Development Grant Report
Research Project: Japanese-Vietnamese parallel corpus construction and its application
Project Researcher: Vo Ho Bao Khanh
The research project works with two main tasks. The first task is building a Japanese and Vietnamese parallel corpus which is the very first corpus of both the languages in the world. The second concerns with detecting corresponding compound words in Japanese-Vietnamese bilingual texts that can be applied in other researches such as co-reference resolution, machine translation, bilingual information retrieval, etc.
The parallel bilingual corpus is a large-scale collection of articles, paragraphs or sentences which contain correspondences of the original text in source language and its translation in the target language. The original data which can be obtained from newspaper articles or journals will be translated by skilled translators. Also, the Japanese-Vietnamese corpus can be directly collected from dictionary example data. After getting raw parallel bilingual corpus, I will perform the main task of corpus construction – annotating the corpus with necessary Part-Of-Speech tags in order to exploit the corpus for another specific task.
The second task of the research process is to determine corresponding compound words in both Japanese and Vietnamese texts. It is an application of bilingual parallel corpus construction in which this bilingual corpus plays the role of testing and training data. I decided to work on compound words because compound words appear most in Japanese texts and Vietnamese texts as well, it is crucial to detect corresponding Japanese – Vietnamese compound nouns and compound verbs. Most of Japanese compound words consist of Sino-Japanese (Kanji) words, and similarly, Vietnamese compound words also consist of Sino-Vietnamese (Han Viet) words. Among these Sino-Japanese and Sino-Vietnamese words, there are corresponding words with quite similar structures. Due to this attribute, detecting corresponding Japanese-Vietnamese compound words directly becomes the main processing. In case the Japanese-Vietnamese dictionary is not sufficient enough, the indirect approach with Japanese-English and English-Vietnamese dictionary is required because these dictionaries consist of a bigger number of words.
Research Activity Results
The research progress has achieved some fruits that follow the research schedule from the beginning of this semester. Typically, the research results are:
a. A parallel corpus extracted from the Japanese-Vietnamese dictionary. This corpus contains about 20.000 pair sentences of both languages.
b. Clear vision about approaches detecting Japanese compound words and multilingual compound words. In last semesters, I have researched many problems relating to detecting and translating of bilingual noun phrases as well as compound words. I have figured out what method to be used in solving my problem from now.
c. Experiment some programs to process with the dictionary, structure of Japanese compound words and Vietnamese morphological problems. This is the very important task to do because from knowing these structures, I can conclude the rules of corresponding compound words. There are a wide range of tasks to do next basing on these rules afterward.
In general, I have finished the main tasks that I proposed in Mori Grant about one year ago. I have finished the task of making a parallel corpus with necessary information. I am working on translating compound words from Japanese to Vietnamese and finding them in Japanese and Vietnamese texts, which are supposed to finish in March.