Learning to Transform Vietnamese Natural Language Queries into SQL Commands44847

Learning to Transform Vietnamese NaturalLanguage Queries into SQL Commands Thi-Hai-Yen Vuong University of Engineering & Technology Vietnam National University Hanoi, Vietnam yenvth@vnu.

Trang 1

Learning to Transform Vietnamese Natural

Language Queries into SQL Commands

Thi-Hai-Yen Vuong

University of Engineering & Technology

Vietnam National University

Hanoi, Vietnam

yenvth@vnu.edu.vn

Thi-Thu-Trang Nguyen University of Engineering & Technology Vietnam National University Hanoi, Vietnam 15021317@vnu.edu.vn

Nhu-Thuat Tran University of Engineering & Technology Vietnam National University Hanoi, Vietnam thuattn@vnu.edu.vn Le-Minh Nguyen

Information Science School

Japan Advanced Institute of Science & Technology

Ishikawa, Japan nguyenml@jaist.ac.jp

Xuan-Hieu Phan University of Engineering & Technology Vietnam National University Hanoi, Vietnam hieupx@vnu.edu.vn

Abstract—In the field of data management, users traditionally

manipulates their data using structured query language (SQL)

However, this method requires an understanding of relational

database, data schema, and SQL syntax as well as the way it

works Database manipulation using natural language, therefore,

is much more convenient since any normal user can interact

with their data without a background of database and SQL This

is, however, really tough because transforming natural language

commands into SQL queries is a challenging task in natural

language processing and understanding In this paper, we propose

a novel two–phase approach to automatically analyzing and

converting natural language queries into the corresponding SQL

forms In our approach, the first phase is component

segmenta-tion which identifies primary clauses in SQL such as SELECT,

FROM, WHERE, ORDER BY, etc The second phase is slot–

filling that helps extract sub–components for each primary clause

such as SELECT column(s), SELECT aggregation operation,

etc We carefully conducted an empirical evaluation for our

method using conditional random fields (CRFs) on a medium–

sized corpus of natural language queries in Vietnamese, and have

achieved promising results with an average accuracy of more than

90%

Index Terms—Understanding natural language query,

trans-form natural language query to SQL query

I INTRODUCTION

Relational databases store a vast amount of data in most of

current information systems A common way to access and

manage the data in those databases is to use the Structured

Query Language (SQL) However, this method requires an

understanding of relational database, data schema, and SQL

syntax as well as the way it works Database manipulation

using natural language, therefore, is much more convenient

since any normal user can interact with their data without a

background of database and SQL Figure 1 shown an example

of the text-to-SQL generation task

Recently, there are various approaches have been proposed

to solve this task Firstly, the components in SQL query are

Fig 1 Example of transforming natural language query to SQL query.

identified manually by using rules [11] The other approaches formalize text-to-SQL task to machine translation problem [5], [8], [9] Semantic parsing is also applied to generate SQL query structure like context-free grammar [14] Most of current approaches only handle very primary queries like WikiSQL dataset [16] While a large number of queries are in complex structure, they include optional components such as join, group

by and nested queries An example which returns the names

of students whose Math scores are greater than the average score of all the student in Ha Noi city as follow:

SELECT Ho_ten FROM Diem_thi WHERE diem_toan > (SELECT AVG(diem_toan)

FROM Diem_thi WHERE cum_thi is "Ha Noi")

In our work, we categorize the SQL queries into three levels based on the complexity of SQL query structure There are simple, medium, and complex level In the simple level, SQL queries only contain the primary components An example which returns the names of students whose Math scores are greater than 9, is illustrated by the following query: “SELECT Ho_ten FROM Diem_thi WHERE diem_toan > 9”

In the medium level, the input query requires knowledge

to analyze and understand to SQL query For example, “the

Trang 2

names of cities with more than 10 students whose scores are

greater than 27 in total score” are shown in the below query

Transforming this input query requires knowledge on “total

score” which is sum of 3 scores (diem toan, diem li and

diem hoa)

SELECT cum_thi FROM Diem_thi

WHERE (diem_toan + diem_li + diem_hoa) > 27

GROUP BY cum_thi

HAVING COUNT(*) AS so_thi_sinh > 10;

In the highest level - complex, the input query includes

sophisticated components such as joint tables, nested queries,

and sub-query Next example is the following query, “which

returns the most frequent Math score in Ha Noi city”

SELECT diem_toan FROM

(SELECT diem_toan, COUNT(*) AS

so_thi_sinh

FROM Diem_thi

WHERE cum_thi IS "Ha Noi"

GROUP BY diem_toan

ORDER BY so_thi_sinh DESC

LIMIT 1)

Most of current approaches can only handle a part of the

simple level and a small part of the complex level

Analyzing and understanding natural language query into

SQL query are challenging tasks Firstly, SQL query contains

complex structure such as multi-nested query Secondly, this

task is not only depend on the input query, but also database

architecture including list of tables, table structure and

rela-tions among tables

In this paper, we propose a new approach for analyzing and

understanding natural language query into SQL query The

process of our approach consists of two major phases: (1)

com-ponent segmentation phase which identifies primary clauses

in SQL architecture such as SELECT, WHERE, GROUP

BY, etc., and (2) slot-filling phase that helps extract several

sub-components for each primary clause such as SELECT

column(s), SELECT aggregation operation, etc We focus on

solving natural language query in the simple level Our work

has three main contributions as follows:

• We proposed a novelty two-phase approach to analyze

and understand natural language query

• We built machine learning models to solve component

segmentation problem and slot-filling problem, by using

Conditional Random Field method

• We also built a medium-size dataset of Vietnamese

natural language queries for evaluation and achieved

promising results

II RELATEDWORK

Semantic parsing In semantic parsing for representation

learning for sequence generation, natural language descriptions

are parsed into logical forms [3]

As a sub-task of semantic parsing, earlier work focuses on

specific databases [4], [7], [10], [12] Recent research

con-siders generalizing the new database by incorporating user’s

guidance [5] Another direction incorporates the data in the table as an additional input [8], [9] The limitations of these approaches are security and scalability issues while handling large scale user databases In 2017, Zhong et al proposed the Seq2SQL [16], a deep neural network combined with policy-based reinforcement learning In 2017, Xu et al proposed SQLNet to solve the order issue that Seq2SQL encountered [13] In 2018, Tao Yu et al proposed TypeSQL based on the architecture of SQLNet and format the task as a slot filling problem [15]

Natural language interface for databases One pioneer-ing study is PRECISE [10], which maps the token in the corresponding query with column attributes, values in the database table Giordani and Moschitti translate questions to SQL queries by first generating candidate queries from a grammar then ranking them using tree kernels [4] The above approaches depends on the accuracy of the grammar and are not suitable for tasks that require generalization to new schema Iyer et al approaches by using neural network model sequence to sequence (Seq2Seq model) [5] The limitation of Seq2Seq model can be overcome by adding human feedback

III OURAPPROACH

A Analyzing and Understanding Natural Language Query Figure 2 and 3 have shown an detailed example of these two phases and input/output in the process

Fig 2 An example of Two-phases in the task.

Fig 3 The process of Two-phases in the task.

In the component segmentation problem,

Trang 3

{select, condition, group by, order by, other} as shown in

Figure I Given an input Vietnamese natural language query

x = (x1, x2, , xn), the component segmentation will segment

the input query into list of SQL clauses Cx= {ci(lcsi , si, ei)}

For each component ci(lcsi , si, ei) ∈ Cx, lcsi (∈ Lcs) is a

component type, si and ei are position of the start token

and the end token of ci in x In this work, we focus

on the simple level of natural language command, which

means that components are non-overlapping The component

segmentation is formalised as sequence tagging problem

TABLE I

C LAUSE TYPES

CLAUSE

TYPES SHORT DESCRIPTION

select sel

Component execute a query that retrieves the information in the database table.

condition col Condition for filtering information in

the database table.

Group by group by Merging records with the same value

in a column or multiple columns.

Order by order by Executing sort results and perform 2

operations (max, min).

Similarly, the slot-filling is also formalised as sequence

tagging problem The list of label in slot-filling phase consist

of four type Lsf = {column, aggregator, operator, value}

shown in Figure II

TABLE II SLOT TYPES SLOT

TYPES SHORT DESCRIPTION

column col Column name in SELECT, WHERE,

etc clause.

aggregator agg Operations: sum, count, average, none.

Operator op by

Comparing operations in where clause for data types of text, numberic (datetime data brought to numberic).

Value val the values in where clause and order

by clause (desc, asc).

B Building Analyzing and Understanding Natural Language

Query Model with Conditional Random Field

For segmentation problems and slot-filling problems,

lin-earchained graphical models like conditional random fields

(CRFs) [6] and Hidden Markov Model (HMM) [1] have

been proven effective based on their encoding the sequential

dependencies between consecutive positions We use CRFs

model to solve these above problems

We use IOB format to represent label for both of the

component segmentation task and the slot-filling task In the

component segmentation task, we define the set of class

la-bels Lcs = {select, condition, group by, order by, other}

The B < component type > indicates the first token

of a component and I < component type > is the

next or last token of that component O is outside of com-ponents Similarly, the set of slot-filling labels is Lsf = {column, aggregator, operator, value}

Training or estimating parameters for CRFs model is to search the optimal weight vector θ = (λ∗1, λ∗2, , λ∗n) that commonly performed by maximizing the likelihood function due to using advanced convex optimization techniques Recent studies have shown that L-BFGS are efficient Prediction labels for new input x is calculated by y∗= argmaxy ∗ ∈Lpθ ∗(y|x)

C Feature Templates for Building The Analyzing and Under-standing Natural Language Query Model

Feature selection is an important part in the analyzing and understanding natural language query model The more specific characteristics of each label that feature template could cover the higher the accuracy of the model Therefore, we design a variety of highly discriminative features shown in Table III The first is contextual feature We use a window

to extract contextual information from word around current position {w−n, , w−2, w−1} is previous words, w0is current position and {w1, w2, , wn} is next words; where n is window size In the component segmentation, we assign the value to n as 7 In slot-filling, n is 5

TABLE III

F EATURE TEMPLATES TO TRAIN THE CRF S MODEL

Contextual feature Context predicate templates Left context [w −n ], , [w −2 ], [w −1 ] Current token [w 0 ]

Right context [w 1 ], [w 2 ], , [w n ] POS tag A Part-Of-Speech Tag Current token [pos 0 ]

Orthographic Orthographic projection Current token [or 0 ]

Dictionaries Text templates for matching databaseinformation

is column name 1-token: [w 0 ]

is table name 2-token: [w −1 w 0 ], [w 0 w 1 ]

3-token: [w −2 w −1 w 0 ], [w 0 w 1 W 2 ] Component Type of component which current token

is in

in select

in condition [l cs

0 ], only in the slot-filling model

in group by

in orderby

In addition to text content, these models used information about part-of-speech (POS) tags of words For languages like Vietnamese, word boundary must first be identified Hence, these models actually use three kinds of information: word tokens, word orthographic, and POS tags of segmented words These add richer features to the models and, therefore, help

to achieve better component segmentation performance Besides, we also use dictionary and clause information for looking-up features: is column name, is table name There

is an additional feature for slot-filling model that is compo-nent type Compocompo-nent type feature is the label information form previous phase: in select, in condition, in group by and

in order by

Trang 4

IV EVALUATION

A Experimental Data

To evaluate the proposed method, we asked annotators

to annotate Vietnamese natural language query dataset We

obtained a mediumsize data set consisting of 1258 queries

on 3 database: High school final test scores, Flight and Book

database Figure 4 and Figure 5 show some statistics in the

dataset including the number of samples corresponding to

components and slot-filling and its proportion in the entire

data set

Fig 4 Label Statistic in the component segmentation phase

Fig 5 Label Statistic in the slot-filling phase

We divide the dataset into 5 folds with train/test splits and

calculate results of the best model per each phase

B Experimental Results and Analysis

In order to prove the performance of the proposed CRFs

model, we conducted experiments to build HMM and Support

Vector Machine (SVM) models [1], [2] with similar feature

se-lection and consider analyzing experimental results carefully

The experiment results illustrate that CRFs achieve the

best result among three models for both of the component

segmentation phase and slot-filling phase It is easy to see

that through Figure 6 and Figure 7 Predicting the most likely

output label sequence in CRFs is not only based on the current observation and current state but also the past and the future observations and states, that is the reason why CRFs outperform HMM SVMs achieve the lowest result among three models Because SVMs do not consider state-to-state dependencies and observation-to-state dependencies like CRFs

or HMM do Furthermore, CRFs propagate the probability

of a state sequence given the observed sequence to mitigate this issue, while SVMs only separate the data into categories

by mapping the data points onto an optimal linear separating hyperplane

Fig 6 Precision, recall and F1 score in CRFs, HMM and SVM model in component the segmentation phase.

Fig 7 Precision, recall and F1 score in CRFs, HMM and SVM model in the slot-filling phase.

Table IV shows performance for each label in the com-ponent segmentation phase The micro-averaged F1-score is 93.48%, it is means that we can achieve a high accuracy level with this feature selection Table IV also indicates that accuracy of selection and groupby is higher than condition and orderby This is understandable because WHERE clause and ORDER BY are more ambiguous

The results reported in Table V show precision, recall and F1-score of each slot type The slot-filling phase get high performance with 91.9% F1-score The aggregator and

Trang 5

TABLE IV

P RECISION , R ECALL AND F1- SCORE OF THE COMPONENT SEGMENTATION

MODEL WITH CRF S

Type Preci-sion Recall F1-score

group by 97.50 97.50 97.50

order by 88.89 90.91 89.89

Average micro 93.43 93.54 93.48

operator have high performances, because aggregator only

belongs to the SELECT clause and operator belongs to the

WHERE clause While the column and value information in

SQL query can belong to a lot of different clauses such as

SELECT clause, GROUP BY clause, ORDER BY clause,

which can lead to ambiguity in predicting precise clause for the

column information Therefore, their performances are lower

and unstable

TABLE V

P RECISION , R ECALL AND F1- SCORE OF THE SLOT - FILLING MODEL WITH

CRF S

Preci-sion Recall F1-score sel-col 87.23 80.39 83.67

cond-col 93.56 92.31 92.93

group by-col 97.50 97.50 97.50

order by-col 87.76 87.76 87.7

sel-agg 96.91 98.95 97.92

cond-op 93.75 96.15 94.94

cond-val 91.81 93.08 92.94

order by-val 92.31 87.80 90.90

Average micro 92.62 91.18 91.90

Figure 8 presents some examples of predictions by the

model and ground truth results In the simple level, which only

contain the non-overlap components, our model could segment

SQL clause and extract subcomponent quite correctly like the

first and the second example In the higher levels of query like

third example, our model predicts ”Khoi A” (Combination A)

as column, which is incorrect In this case, we needs domain

knowledge to understand “combination A” Additionally, our

model cannot cover the complex queries which contain joint

tables and nested query

V CONCLUSION

In this work, we propose a new two-phase approach for

converting Vietnamese natural language query into structured

query language The results of this research might assist both

experienced and inexperienced users to manage databases

easily In our novelty approach, both two phases are formalized

as the sequence tagging problem which are easily to be applied

some notable machine learning methods to solve In the scope

of this research, CRFs model outdoes HMM model and SVM

model This approach has completely solved the problem at a

simple level of natural query In the next step, the approach

Fig 8 Example of predictions by the model and and ground truth results.

Q denotes the natural language query L and L’ denote the ground truth label and the label produced by the model S and S’ denote the ground truth SQL query and the SQL query produced by the model.

can be developed further by solving the natural language query

at medium level and high level

REFERENCES [1] Baum, L.E and Petrie, T., 1966 “Statistical inference for probabilistic functions of finite state Markov chains.” The annals of mathematical statistics, 37(6), pp.1554-1563.

[2] Cortes, C and Vapnik, V., 1995 Support-vector networks Machine learning, 20(3), pp.273-297.

[3] Dong, L., and Lapata, M (2016) “Language to Logical Form with Neural Attention.” CoRR, abs/1601.01280.

[4] Giordani, A and Moschitti, A., (2012) “Translating questions to SQL queries with generative parsers discriminatively reranked.” Proceedings

of COLING 2012: Posters, pp.401-410.

[5] Iyer, S., Konstas, I., Cheung, A., Krishnamurthy, J., and Zettlemoyer, L.S (2017) “Learning a Neural Semantic Parser from User Feedback.” ACL.

[6] Lafferty, J., McCallum, A and Pereira, F.C., 2001 “Conditional random fields: Probabilistic models for segmenting and labeling sequence data.” [7] Li, Y., Yang, H., Jagadish, HV (2006), “Constructing a generic natural language interface for an XML database,” In EDBT, Vol.3896, pp 737-754.

[8] Mou, L., Lu, Z., Li, H., and Jin, Z (2017) “Coupling Distributed and Symbolic Execution for Natural Language Queries.” ICLR.

[9] Pasupat, P., and Liang, P.S (2015) “Compositional Semantic Parsing

on Semi-Structured Tables.” ACL.

[10] Popescu, AM., Etzioni, O., Kautz, H (2003), “Towards a theory of natural language interfaces to databases”, In Proceedings of the 8th International Conference on Intelligent User Interface, pp 149-157, In ACM.

[11] Stratica, N., Kosseim, L and Desai, B.C., 2005 “Using semantic templates for a natural language interface to the CINDI virtual library” Data & Knowledge Engineering, 55(1), pp.4-19.

[12] Wang, C., Cheung, A., Bodik, R (2017), “Synthesizing highly expres-sive SQL queries from input-output examples,” In Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation, pp.452-466, In ACM.

[13] Xu, X., Liu, C and Song, D., 2017 “Sqlnet: Generating structured queries from natural language without reinforcement learning” [14] Yaghmazadeh, N., Wang, Y., Dillig, I and Dillig, T., 2017 “SQLizer: query synthesis from natural language Proceedings of the ACM on Programming Languages,” 1(OOPSLA), p.63.

[15] Yu, T., Li, Z., Zhang, Z., Zhang, R and Radev, D., 2018 “Typesql: Knowledge-based type-aware neural text-to-sql generation.”

[16] Zhong, V., Xiong, C and Socher, R., 2017 “Seq2sql: Generating structured queries from natural language using reinforcement learning.”

Định dạng
Số trang	5
Dung lượng	426,99 KB