A Vietnamese Natural Language Interface toDatabase Dat Tien Nguyen University of Engineering and Technology Vietnam National University, Hanoi Email: datnt88@gmail.com Tam Duc Hoang Univ
Trang 1A Vietnamese Natural Language Interface to
Database
Dat Tien Nguyen
University of Engineering
and Technology Vietnam National University,
Hanoi Email: datnt88@gmail.com
Tam Duc Hoang
University of Engineering and Technology Vietnam National University,
Hanoi Email: tamhd1990@gmail.com
Son Bao Pham
University of Engineering and Technology Vietnam National University,
Hanoi Email: sonpb@vnu.edu.vn
Abstract—In this paper, we present a Vietnamese Natural
Language Interface to a survey database for individuals and
businesses who want to know economic information from
eco-nomic surveys We carry out analysis of various Vietnamese
question types and investigate the stability of our approach using
GATE framework and R language Our system uses R language
to specifically deal with statistical question types Experimental
results are very promising where our system achieves an accuracy
of 78.5% on economic survey data domain.
I INTRODUCTION Generally, a computer database can be accessed via
tech-nical commands or queries that requires a certain level of
database knowledge from users Natural Language Interface
(NLI) aimed to allow everyone, or rather, all clients to ask
questions to the database in a natural language format The
database can be in the form of an SQL database or spreadsheet
Thus, a NLI system needs to take a natural language question
and turn that into an appropriate technical query for the
database
With the recent advances in Natural language processing
and Human Computer Interaction, users are not expected to
remember complex structures of machine language to interact
with the computers Instead, users expect computers to be
intelligent and be able to interact with human in a natural
way Therefore, NLI become a necessary requirement in
fa-cilitating the interaction between users and the database In
this paper, we proposed an approach to build a Vietnamese
Natural Language Interface to Database The database of our
system is collected from an economic survey of three thousand
companies in Vietnam which we will provide further details
later on
Our system has two components: Question Analysis module
and Result Computing module The first component identifies
question types and extracts core information from users’
questions For the domain of economic survey data, users
are interested in various statistical measures The second
component aims to identify the particular data that the user
request as well as computing the target statistics The statistics
computation is done by formulating the users questions into
statements in R which is a popular statistical programming
language [1], [2]
Our paper is organized as follows In section II, we review
some existing approaches that have been proposed for
devel-oping a question answering system In section III, we present our system and evaluate it in section IV The conclusion and future work will be presented in section V
II RELATEDWORK Esther Kaufmann,Abraham Bernstein, and Renato Zumstein introduced QUERIX (A natural language interface to query Ontologies based on Clarification Diaglogs) [4] It extracts the sequence of the main word categories: Noun, Verb, Prepo-sition, Wh-Word and Conjunction Then, its core component, the matching center perform 3 steps: identifying the subject-property-object patterns of a query, searching for all matches between the synonyms in the ontology, composing SPARQL queries from the joined triples Their system achieved an average recall of 87.11% and an average precision of 86.08%, guaranteeing an efficient method of simple structure PRECISE [5] is a NLIDB question answering system that takes as its input a natural language question to generate a corresponding SQL query PRECISE showed a high precision (over 80% in a list of hundreds English questions) However PRECISE requires all tokens in input questions to be distinct and appear in its lexicon
Aqualog [6] is an ontology-based question answering system for English Aqualog takes a natural language question and an ontology as its input, and returns an answer for users based on the semantic analysis of the question and the corresponding elements in the ontology This system use many processing language resource such as word segmentation, sentence seg-ment, part-of-speech tagging to analyze semantic and syntactic
of user question The natural question is transfered into a Query-Triple with format (generic term, relation, second term) System use JAPE grammars in GATE to identify terms and their relation, then perform exact result by calculating on database
In the Vietnamese research, K Nguyen and H Le [7] introduce a NLIDB (Natural Language Interface to DataBases) question answering system in Vietnamese employing semantic grammars Their system contains 2 main modules: QTRAN and TGEN Query Translator (QTRAN) maps a natural lan-guage question to a SQL query and Text Generator (TGEN) generates answers To , QTRAN analyzes question in to systax tree via CYK algorithm by using limited context-free grammars Then, the syntax tree is converted into SQL
2012 IEEE Sixth International Conference on Semantic Computing
2012 IEEE Sixth International Conference on Semantic Computing
Trang 2query by using a mapping dictionary that determine names
of attributes TGEN module find out the relationship between
meta-data and relations in the database tables to generate exact
answers Patten-based and keyword-based approaches are used
in TGEN module
Based on previous research for English, V M Tra, V D
Nguyen, O T Tran, U T T Pham, and T Q Ha [8] proposed
an implementation for Vietnamese question answering system
by combining SnowBall system [9] and semantic relation
extraction using search engines They proved that this proposed
method is sufficient for Vietnamese question answering system
by experimental results on traveling domain They achieved
89.7% precision and 91.4% ability to give the answers when
testing on traveling domain
III OURVIETNAMESENATURAL LANGUAGE IINTERFACE
TODATABASE Our system is implemented in Java as a web application
using a client-server model The general architecture of our
NLI system is shown in Figure 1
Figure 1: System Architecture
A Question Analysis Component
In our domain, a typical question contains three parts:
question term, type of question and question information
Question term indicates the target answer type Question type
captures the statistical measure that users are after Question
information is the type of requested data which will be used
to retrieve corresponding fields in the database
For example, given the following question: ’Cho biết trung
bình doanh số thu được trong năm 2003 là bao nhiêu?’.
In which phrases "Cho biết" and "Là bao nhiêu How_much "
are Question Terms The word "trung bình average " indicates
the statistical type of question, which is the average value
in this case The phrase "doanh số thu được trong năm
2003 sale_which_is_gained_in_2003" contains the main content of
the question, capturing the core data that users are looking for
The above question has same meaning and similar question
components as the following: "Năm 2003, doanh số trung bình
thu được là bao nhiêu?".
The task of the Question Analysis Component is to return
Question Term, Type of Question and Question Information
from an input question of the user Figure 4 shows how the
Question Analysis Component works
1) Detecting Question Term: Questions generally contain
not only asked information but also special words or phrase
Such special ones, called Question Term, often appear at the
beginning or the end of the question Examples of Vietnamese
question terms are : "tìm ra" list , "bao nhiêu" how_many , "như thế nào" how We built Jape rules to detect Question Terms and integrate the results into the Vietnamese word segmenta-tion VNTokenizer [10] so those quessegmenta-tion terms are correctly recognized as words An example of a Jape rule is shown in Figure 2
Figure 2: A Jape pattern captured Question Term
2) Question Type: There are many types of questions
in Vietnamese Some common question types are Yes/No, Calculating, Give Reason (Why) and Comparing type etc However, in this paper, we work with survey database which contains statistical economic information We build a statistical question answering system that specifically meet users need about statistical information of the survey data Thus, our system our system focuses on questions that calculate the statistical measures such as average, sum, counting, deviation, decline calculation etc
Question category is indicative of the answer type It also guides the R Engine of the second component to compute results from survey database We created Jape rules using noun phrases annotations and the question-word information identified by the preprocessing to identify the question type Figure 3 shows an example of the Jape pattern that captured deviation question type annotation
Figure 3: A Jape pattern captured deviation question type
B Result Computing Component
The task of this component is to find the columns in the database that correspond to the Question Information In order
to do this, this system analyzes the survey questionnaires and rank the information of each question in the questionnaire We
Trang 3Figure 4: Question Analysis Component
call this process Relevant Data Retrieval After determining
the required columns in the database, the R Engine uses the
columns and the Question Type as parameters to compute the
final result
1) Relevant Data Retrieval: Before a question can be
answered, the relevant knowledge sources must be identified
If the answer to a question is not available in the data sources,
no matter how well other components work, a correct result
will not be found
This component uses the question information to determine
relevant data in the database This part is very important
because it need to give the exact parameter (columns) to the
R Engine for computing results on the survey database The
meaning of each column in the database is represented by a
phrase, so the most straightforward way would be to match the
question information of the input question with the column
phrase Obviously, using exact word matching would only
achieve limited result To utilize words with similar meanings,
one construct synonym dictionaries using Term Extraction
[11][12] or Term Collocation [13]
For our system, we construct synonym dictionaries for
economic and statistical data domain
Vietnamese Synonyms
By identifying the semantic similarity [14] we automatically
build a synonym dictionary for all nouns, verbs and adjectives
which appear in the survey questionnaire data Because, in
some cases, users use abbreviation words, we also add
abbrevi-ations to the synonym dictionary All words and abbreviabbrevi-ations
which have the same meaning are organized in one group
Hence, they are viewed similar to each other in the ranking
step
For example: "công ty"company , "doanh nghiệp" enterprise,
"DN" and "Cty" are synonyms where "DN" and "Cty" are
ab-breviations for "doanh nghiệp" enterprise and "công ty" company
respectively
Ranking questioned Information
This step uses the question’s segmented words [10] and the
synonym dictionary to rank the relevant columns in
question-naire data (shown in Figure 5)
In Vietnamese, tokens or phrases can change position with
each other to express different sentences, but still keep the
same meaning In this paper, we use the Bag of Words
Figure 5: Ranking questioned information process
approach to represent the information in questionnaire and input questions A question is represented by its segmented words and their corresponding synonyms Columns in the database are then ranked based on the number of matched tokens from the input question
However, sometimes the questions of user are ambigu-ous leading to columns in the database having the same ranking For these situations, we present this information to the users as suggestion for related information We think that this is very useful as in many cases users ask one question but expect all related answers For instance: When
the user asks: - Cho biết tổng số lao động năm 2003 là bao nhiêu? the_number_of _employees_in_2003 our system also suggests a related information by providing "the number of employees in 2004"
Our system ensures that the users always get some answers Even if the answer might not be totally correct, it is the best attempt to find the most relevant one
2) R Engine: R language provides a wide variety of
sta-tistical and graphical tools, including linear and nonlinear modeling, classical statistical tests, time-series analysis, clas-sification, clustering, and others
The task of R Engine is to compute the statistical results
of the users’ question The R Engine uses the columns in the survey database and Question Type as parameters to compute result Besides, to improve the interaction between users and system, R Engine also calculate the results for the top ranked columns to present related answers to users
IV EXPERIMENTAL RESULT
A Data setup
We use a large economic survey data to test our NLI system The survey is conducted by on approximately three thousand of small and medium-sized enterprises in Vietnam The questionnaire includes many closed questions about the financial situation of a company as well as questions on organizational methods, the number of employees, the salary status etc All participating companies fully complete the questionnaire
For example: Câu hỏi Q11: "Cho biết số lượng công nhân
của công ty?" How_many_employees_are_there_in_yourcompany? Trả lờiAnswer: ngườiEmployee
The answer of each company the the questionnaire is a row in database Each question is a column in the database
In above question, the answer of all companies is stored in
column named Q11 in the database The results of survey is
Trang 4stored in the a file ".cvs" that contains 300 columns and 2975
rows
Experimental setup
We collected 500 real questions from users who are seeking
information from the survey 300 questions are used for
the training phase to build our system The 200 remaining
questions are used to evaluate the correctness of our system
B Evaluation and results
Out of these 200 questions, 192 questions are correctly
processed by the Question Analysis component resulting in
96% accuracy The 4% errors are due to the lack of coverage
of our Jape grammars in capturing the types of question as
well as the Question Terms
After testing the first component we evaluate overall
cor-rectness of the system The results are shown in Table I
Table I: Correctness of system
Number question Percent
Our system gives correct answers to 157 questions
35 questions are returned incorrect results This is
because of the ambiguity in word segmentation process
by VnTokenizer [10] For example, consider the
question Doanh số trung bình các doanh nghiệp năm
2004? what is average income of all companies in2004?. After
segmentation, we have the phrase set: doanh-số sales ,
trung-bình average , các-doanh-nghiệp all enterprises ,
năm-2004 year_2004 VnTokenizer considers năm-2004year2004 is a
phrase in Vietnamese However,there should be two words
"nămyear" and "2004" In this situation and other cases, we
discover the correct answer in the related answers suggestion
of system
Figure 6: Correct answers with suggestions
Interesting Results
Our system has an ability to answer questions
with general meaning When user ask a general
question like Hãy cho biết tổng doanh số theo
năm? Calculating_the_sum_of _income_in_each_year Our system
returned all related results In other words, it compute the
income information in the year 2003 and 2004 stored in the
survey database
V CONCLUSIONS Question answering research attempts to deal with a wide range of question types It requires more sophisticate tech-niques than traditional information retrieval such as document retrieval
In this paper, we propose an approach that uses Vietnamese language processing techniques coordinated with R Language
to develop a complete Natural Language Interface system for individuals and businesses who want to get statistical information from economic surveys
We introduce a Vietnamese natural language interface to survey database Our system consists of two components namely Question Analysis and Result Computing Experimen-tal results of the system on a wide range of statistical questions are promising Specifically, our system achieves an accuracy
of 78.5% Besides, system can help users find out the related information in the survey database by suggesting related results for users
In the future, we will improve the accuracy of system by using Vietnamese grammar in analyzing questions and include additional question types
REFERENCES [1] J Fox and R Andersen, “Using the R statistical computing environment
to teach social statistics courses,” Department of Sociology, McMaster University, 2006.
[2] A Vance, “Data analysts captivated by R’s power,” New York Times,
2009.
[3] D S Gary G Hendrix, Earl D Sacerdoti and J Slocum, “Developing
a natural language interface to complex data,” ACM Transactions on Database Systems (TODS) TODS, 1987.
[4] A B Esther Kaufmann and R Zumstein, “Querix: A natural language
interface to query ontologies based on clarification dialogs,” In proceed-ings of the 5th International Semantic Web Conference (ISWC 2006), Athens, GA, 2006.
[5] A Popescu, O Etzioni, and H Kautz, “Towards a theory of natural
language interfaces to databases,” In Proceedings of IUI, 2003.
[6] V Lopez, V Uren, E Motta, and M Pasin, “Aqualog: An ontology-driven question answering system for organizational semantic intranets,”
Journal of Web Semantics, 5(2):72-105, Elsevier, 2007.
[7] T H Le and K A Nguyen, “Natural language interface construction
using semantic grammars,” In Proceedings of PRICAI, 2008.
[8] V M Tra, V D Nguyen, O T Tran, U T T Pham, and T Q Ha,
“An experimental study of Vietnamese question answering system,”
International Conference on Asian Language Processing IALP, 2009.
[9] E Agichtein, L Gravano, J Pavel, V Sokolova, and A Voskoboynik,
“Snowball: A prototype system for extracting relations from large text
collections,” ACM’s Special Interest Group on Management Of Data,
2001.
[10] D D Pham, G B Tran, and S B Pham, “A hybrid approach to
Viet-namese Word Segmentation using Part of Speech tags,” International Conference on Knowledge and Systems Engineering KSE, 2009.
[11] Y Sasaki, “Question answering as question-biased term extraction: a new
approach toward multilingual QA,” Proceeding ACL ’05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics,
2005.
[12] T R Lynam, C L A Clarke, and G V Cormack, “Proceeding hlt
’01 proceedings of the first international conference on human language
technology research,” Unknown Journal, 2001.
[13] N J McCracken, A R Diekema, G Ingersoll, S C Harwell, E E Allen, O Yilmazel, and E D Liddy, “Modeling reference interviews
as a basis for improving automatic QA systems,” Proceeding IQA ’06 Proceedings of the Interactive Question Answering Workshop at HLT-NAACL, 2006.
[14] D T Nguyen and S B Pham, “Finding the semantic similarity in
Vietnamese”„” International Conference on Asian Language Processing IALP, 2010.