DSpace at VNU: A Vietnamese natural language interface to database

A Vietnamese Natural Language Interface toDatabase Dat Tien Nguyen University of Engineering and Technology Vietnam National University, Hanoi Email: datnt88@gmail.com Tam Duc Hoang Univ

Trang 1

A Vietnamese Natural Language Interface to

Database

Dat Tien Nguyen

University of Engineering

and Technology Vietnam National University,

Hanoi Email: datnt88@gmail.com

Tam Duc Hoang

University of Engineering and Technology Vietnam National University,

Hanoi Email: tamhd1990@gmail.com

Son Bao Pham

University of Engineering and Technology Vietnam National University,

Hanoi Email: sonpb@vnu.edu.vn

Abstract—In this paper, we present a Vietnamese Natural

Language Interface to a survey database for individuals and

businesses who want to know economic information from

eco-nomic surveys We carry out analysis of various Vietnamese

question types and investigate the stability of our approach using

GATE framework and R language Our system uses R language

to specifically deal with statistical question types Experimental

results are very promising where our system achieves an accuracy

of 78.5% on economic survey data domain.

I INTRODUCTION Generally, a computer database can be accessed via

tech-nical commands or queries that requires a certain level of

database knowledge from users Natural Language Interface

(NLI) aimed to allow everyone, or rather, all clients to ask

questions to the database in a natural language format The

database can be in the form of an SQL database or spreadsheet

Thus, a NLI system needs to take a natural language question

and turn that into an appropriate technical query for the

database

With the recent advances in Natural language processing

and Human Computer Interaction, users are not expected to

remember complex structures of machine language to interact

with the computers Instead, users expect computers to be

intelligent and be able to interact with human in a natural

way Therefore, NLI become a necessary requirement in

fa-cilitating the interaction between users and the database In

this paper, we proposed an approach to build a Vietnamese

Natural Language Interface to Database The database of our

system is collected from an economic survey of three thousand

companies in Vietnam which we will provide further details

later on

Our system has two components: Question Analysis module

and Result Computing module The first component identifies

question types and extracts core information from users’

questions For the domain of economic survey data, users

are interested in various statistical measures The second

component aims to identify the particular data that the user

request as well as computing the target statistics The statistics

computation is done by formulating the users questions into

statements in R which is a popular statistical programming

language [1], [2]

Our paper is organized as follows In section II, we review

some existing approaches that have been proposed for

devel-oping a question answering system In section III, we present our system and evaluate it in section IV The conclusion and future work will be presented in section V

II RELATEDWORK Esther Kaufmann,Abraham Bernstein, and Renato Zumstein introduced QUERIX (A natural language interface to query Ontologies based on Clarification Diaglogs) [4] It extracts the sequence of the main word categories: Noun, Verb, Prepo-sition, Wh-Word and Conjunction Then, its core component, the matching center perform 3 steps: identifying the subject-property-object patterns of a query, searching for all matches between the synonyms in the ontology, composing SPARQL queries from the joined triples Their system achieved an average recall of 87.11% and an average precision of 86.08%, guaranteeing an efficient method of simple structure PRECISE [5] is a NLIDB question answering system that takes as its input a natural language question to generate a corresponding SQL query PRECISE showed a high precision (over 80% in a list of hundreds English questions) However PRECISE requires all tokens in input questions to be distinct and appear in its lexicon

Aqualog [6] is an ontology-based question answering system for English Aqualog takes a natural language question and an ontology as its input, and returns an answer for users based on the semantic analysis of the question and the corresponding elements in the ontology This system use many processing language resource such as word segmentation, sentence seg-ment, part-of-speech tagging to analyze semantic and syntactic

of user question The natural question is transfered into a Query-Triple with format (generic term, relation, second term) System use JAPE grammars in GATE to identify terms and their relation, then perform exact result by calculating on database

In the Vietnamese research, K Nguyen and H Le [7] introduce a NLIDB (Natural Language Interface to DataBases) question answering system in Vietnamese employing semantic grammars Their system contains 2 main modules: QTRAN and TGEN Query Translator (QTRAN) maps a natural lan-guage question to a SQL query and Text Generator (TGEN) generates answers To , QTRAN analyzes question in to systax tree via CYK algorithm by using limited context-free grammars Then, the syntax tree is converted into SQL

2012 IEEE Sixth International Conference on Semantic Computing

Trang 2

query by using a mapping dictionary that determine names

of attributes TGEN module find out the relationship between

meta-data and relations in the database tables to generate exact

answers Patten-based and keyword-based approaches are used

in TGEN module

Based on previous research for English, V M Tra, V D

Nguyen, O T Tran, U T T Pham, and T Q Ha [8] proposed

an implementation for Vietnamese question answering system

by combining SnowBall system [9] and semantic relation

extraction using search engines They proved that this proposed

method is sufficient for Vietnamese question answering system

by experimental results on traveling domain They achieved

89.7% precision and 91.4% ability to give the answers when

testing on traveling domain

III OURVIETNAMESENATURAL LANGUAGE IINTERFACE

TODATABASE Our system is implemented in Java as a web application

using a client-server model The general architecture of our

NLI system is shown in Figure 1

Figure 1: System Architecture

A Question Analysis Component

In our domain, a typical question contains three parts:

question term, type of question and question information

Question term indicates the target answer type Question type

captures the statistical measure that users are after Question

information is the type of requested data which will be used

to retrieve corresponding fields in the database

For example, given the following question: ’Cho biết trung

bình doanh số thu được trong năm 2003 là bao nhiêu?’.

In which phrases "Cho biết" and "Là bao nhiêu How_much "

are Question Terms The word "trung bình average " indicates

the statistical type of question, which is the average value

in this case The phrase "doanh số thu được trong năm

2003 sale_which_is_gained_in_2003" contains the main content of

the question, capturing the core data that users are looking for

The above question has same meaning and similar question

components as the following: "Năm 2003, doanh số trung bình

thu được là bao nhiêu?".

The task of the Question Analysis Component is to return

Question Term, Type of Question and Question Information

from an input question of the user Figure 4 shows how the

Question Analysis Component works

1) Detecting Question Term: Questions generally contain

not only asked information but also special words or phrase

Such special ones, called Question Term, often appear at the

beginning or the end of the question Examples of Vietnamese

question terms are : "tìm ra" list , "bao nhiêu" how_many , "như thế nào" how We built Jape rules to detect Question Terms and integrate the results into the Vietnamese word segmenta-tion VNTokenizer [10] so those quessegmenta-tion terms are correctly recognized as words An example of a Jape rule is shown in Figure 2

Figure 2: A Jape pattern captured Question Term

2) Question Type: There are many types of questions

in Vietnamese Some common question types are Yes/No, Calculating, Give Reason (Why) and Comparing type etc However, in this paper, we work with survey database which contains statistical economic information We build a statistical question answering system that specifically meet users need about statistical information of the survey data Thus, our system our system focuses on questions that calculate the statistical measures such as average, sum, counting, deviation, decline calculation etc

Question category is indicative of the answer type It also guides the R Engine of the second component to compute results from survey database We created Jape rules using noun phrases annotations and the question-word information identified by the preprocessing to identify the question type Figure 3 shows an example of the Jape pattern that captured deviation question type annotation

Figure 3: A Jape pattern captured deviation question type

B Result Computing Component

The task of this component is to find the columns in the database that correspond to the Question Information In order

to do this, this system analyzes the survey questionnaires and rank the information of each question in the questionnaire We

Trang 3

Figure 4: Question Analysis Component

call this process Relevant Data Retrieval After determining

the required columns in the database, the R Engine uses the

columns and the Question Type as parameters to compute the

final result

1) Relevant Data Retrieval: Before a question can be

answered, the relevant knowledge sources must be identified

If the answer to a question is not available in the data sources,

no matter how well other components work, a correct result

will not be found

This component uses the question information to determine

relevant data in the database This part is very important

because it need to give the exact parameter (columns) to the

R Engine for computing results on the survey database The

meaning of each column in the database is represented by a

phrase, so the most straightforward way would be to match the

question information of the input question with the column

phrase Obviously, using exact word matching would only

achieve limited result To utilize words with similar meanings,

one construct synonym dictionaries using Term Extraction

[11][12] or Term Collocation [13]

For our system, we construct synonym dictionaries for

economic and statistical data domain

Vietnamese Synonyms

By identifying the semantic similarity [14] we automatically

build a synonym dictionary for all nouns, verbs and adjectives

which appear in the survey questionnaire data Because, in

some cases, users use abbreviation words, we also add

abbrevi-ations to the synonym dictionary All words and abbreviabbrevi-ations

which have the same meaning are organized in one group

Hence, they are viewed similar to each other in the ranking

step

For example: "công ty"company , "doanh nghiệp" enterprise,

"DN" and "Cty" are synonyms where "DN" and "Cty" are

ab-breviations for "doanh nghiệp" enterprise and "công ty" company

respectively

Ranking questioned Information

This step uses the question’s segmented words [10] and the

synonym dictionary to rank the relevant columns in

question-naire data (shown in Figure 5)

In Vietnamese, tokens or phrases can change position with

each other to express different sentences, but still keep the

same meaning In this paper, we use the Bag of Words

Figure 5: Ranking questioned information process

approach to represent the information in questionnaire and input questions A question is represented by its segmented words and their corresponding synonyms Columns in the database are then ranked based on the number of matched tokens from the input question

However, sometimes the questions of user are ambigu-ous leading to columns in the database having the same ranking For these situations, we present this information to the users as suggestion for related information We think that this is very useful as in many cases users ask one question but expect all related answers For instance: When

the user asks: - Cho biết tổng số lao động năm 2003 là bao nhiêu? the_number_of _employees_in_2003 our system also suggests a related information by providing "the number of employees in 2004"

Our system ensures that the users always get some answers Even if the answer might not be totally correct, it is the best attempt to find the most relevant one

2) R Engine: R language provides a wide variety of

sta-tistical and graphical tools, including linear and nonlinear modeling, classical statistical tests, time-series analysis, clas-sification, clustering, and others

The task of R Engine is to compute the statistical results

of the users’ question The R Engine uses the columns in the survey database and Question Type as parameters to compute result Besides, to improve the interaction between users and system, R Engine also calculate the results for the top ranked columns to present related answers to users

IV EXPERIMENTAL RESULT

A Data setup

We use a large economic survey data to test our NLI system The survey is conducted by on approximately three thousand of small and medium-sized enterprises in Vietnam The questionnaire includes many closed questions about the financial situation of a company as well as questions on organizational methods, the number of employees, the salary status etc All participating companies fully complete the questionnaire

For example: Câu hỏi Q11: "Cho biết số lượng công nhân

của công ty?" How_many_employees_are_there_in_yourcompany? Trả lờiAnswer: ngườiEmployee

The answer of each company the the questionnaire is a row in database Each question is a column in the database

In above question, the answer of all companies is stored in

column named Q11 in the database The results of survey is

Trang 4

stored in the a file ".cvs" that contains 300 columns and 2975

rows

Experimental setup

We collected 500 real questions from users who are seeking

information from the survey 300 questions are used for

the training phase to build our system The 200 remaining

questions are used to evaluate the correctness of our system

B Evaluation and results

Out of these 200 questions, 192 questions are correctly

processed by the Question Analysis component resulting in

96% accuracy The 4% errors are due to the lack of coverage

of our Jape grammars in capturing the types of question as

well as the Question Terms

After testing the first component we evaluate overall

cor-rectness of the system The results are shown in Table I

Table I: Correctness of system

Number question Percent

Our system gives correct answers to 157 questions

35 questions are returned incorrect results This is

because of the ambiguity in word segmentation process

by VnTokenizer [10] For example, consider the

question Doanh số trung bình các doanh nghiệp năm

2004? what is average income of all companies in2004?. After

segmentation, we have the phrase set: doanh-số sales ,

trung-bình average , các-doanh-nghiệp all enterprises ,

năm-2004 year_2004 VnTokenizer considers năm-2004year2004 is a

phrase in Vietnamese However,there should be two words

"nămyear" and "2004" In this situation and other cases, we

discover the correct answer in the related answers suggestion

of system

Figure 6: Correct answers with suggestions

Interesting Results

Our system has an ability to answer questions

with general meaning When user ask a general

question like Hãy cho biết tổng doanh số theo

năm? Calculating_the_sum_of _income_in_each_year Our system

returned all related results In other words, it compute the

income information in the year 2003 and 2004 stored in the

survey database

V CONCLUSIONS Question answering research attempts to deal with a wide range of question types It requires more sophisticate tech-niques than traditional information retrieval such as document retrieval

In this paper, we propose an approach that uses Vietnamese language processing techniques coordinated with R Language

to develop a complete Natural Language Interface system for individuals and businesses who want to get statistical information from economic surveys

We introduce a Vietnamese natural language interface to survey database Our system consists of two components namely Question Analysis and Result Computing Experimen-tal results of the system on a wide range of statistical questions are promising Specifically, our system achieves an accuracy

of 78.5% Besides, system can help users find out the related information in the survey database by suggesting related results for users

In the future, we will improve the accuracy of system by using Vietnamese grammar in analyzing questions and include additional question types

REFERENCES [1] J Fox and R Andersen, “Using the R statistical computing environment

to teach social statistics courses,” Department of Sociology, McMaster University, 2006.

[2] A Vance, “Data analysts captivated by R’s power,” New York Times,

2009.

[3] D S Gary G Hendrix, Earl D Sacerdoti and J Slocum, “Developing

a natural language interface to complex data,” ACM Transactions on Database Systems (TODS) TODS, 1987.

[4] A B Esther Kaufmann and R Zumstein, “Querix: A natural language

interface to query ontologies based on clarification dialogs,” In proceed-ings of the 5th International Semantic Web Conference (ISWC 2006), Athens, GA, 2006.

[5] A Popescu, O Etzioni, and H Kautz, “Towards a theory of natural

language interfaces to databases,” In Proceedings of IUI, 2003.

[6] V Lopez, V Uren, E Motta, and M Pasin, “Aqualog: An ontology-driven question answering system for organizational semantic intranets,”

Journal of Web Semantics, 5(2):72-105, Elsevier, 2007.

[7] T H Le and K A Nguyen, “Natural language interface construction

using semantic grammars,” In Proceedings of PRICAI, 2008.

[8] V M Tra, V D Nguyen, O T Tran, U T T Pham, and T Q Ha,

“An experimental study of Vietnamese question answering system,”

International Conference on Asian Language Processing IALP, 2009.

[9] E Agichtein, L Gravano, J Pavel, V Sokolova, and A Voskoboynik,

“Snowball: A prototype system for extracting relations from large text

collections,” ACM’s Special Interest Group on Management Of Data,

2001.

[10] D D Pham, G B Tran, and S B Pham, “A hybrid approach to

Viet-namese Word Segmentation using Part of Speech tags,” International Conference on Knowledge and Systems Engineering KSE, 2009.

[11] Y Sasaki, “Question answering as question-biased term extraction: a new

approach toward multilingual QA,” Proceeding ACL ’05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics,

2005.

[12] T R Lynam, C L A Clarke, and G V Cormack, “Proceeding hlt

’01 proceedings of the first international conference on human language

technology research,” Unknown Journal, 2001.

[13] N J McCracken, A R Diekema, G Ingersoll, S C Harwell, E E Allen, O Yilmazel, and E D Liddy, “Modeling reference interviews

as a basis for improving automatic QA systems,” Proceeding IQA ’06 Proceedings of the Interactive Question Answering Workshop at HLT-NAACL, 2006.

[14] D T Nguyen and S B Pham, “Finding the semantic similarity in

Vietnamese”„” International Conference on Asian Language Processing IALP, 2010.

Định dạng
Số trang	4
Dung lượng	273,63 KB