Learning to Transform Vietnamese NaturalLanguage Queries into SQL Commands Thi-Hai-Yen Vuong University of Engineering & Technology Vietnam National University Hanoi, Vietnam yenvth@vnu.
Trang 1Learning to Transform Vietnamese Natural
Language Queries into SQL Commands
Thi-Hai-Yen Vuong
University of Engineering & Technology
Vietnam National University
Hanoi, Vietnam
yenvth@vnu.edu.vn
Thi-Thu-Trang Nguyen University of Engineering & Technology Vietnam National University Hanoi, Vietnam 15021317@vnu.edu.vn
Nhu-Thuat Tran University of Engineering & Technology Vietnam National University Hanoi, Vietnam thuattn@vnu.edu.vn Le-Minh Nguyen
Information Science School
Japan Advanced Institute of Science & Technology
Ishikawa, Japan nguyenml@jaist.ac.jp
Xuan-Hieu Phan University of Engineering & Technology Vietnam National University Hanoi, Vietnam hieupx@vnu.edu.vn
Abstract—In the field of data management, users traditionally
manipulates their data using structured query language (SQL)
However, this method requires an understanding of relational
database, data schema, and SQL syntax as well as the way it
works Database manipulation using natural language, therefore,
is much more convenient since any normal user can interact
with their data without a background of database and SQL This
is, however, really tough because transforming natural language
commands into SQL queries is a challenging task in natural
language processing and understanding In this paper, we propose
a novel two–phase approach to automatically analyzing and
converting natural language queries into the corresponding SQL
forms In our approach, the first phase is component
segmenta-tion which identifies primary clauses in SQL such as SELECT,
FROM, WHERE, ORDER BY, etc The second phase is slot–
filling that helps extract sub–components for each primary clause
such as SELECT column(s), SELECT aggregation operation,
etc We carefully conducted an empirical evaluation for our
method using conditional random fields (CRFs) on a medium–
sized corpus of natural language queries in Vietnamese, and have
achieved promising results with an average accuracy of more than
90%
Index Terms—Understanding natural language query,
trans-form natural language query to SQL query
I INTRODUCTION
Relational databases store a vast amount of data in most of
current information systems A common way to access and
manage the data in those databases is to use the Structured
Query Language (SQL) However, this method requires an
understanding of relational database, data schema, and SQL
syntax as well as the way it works Database manipulation
using natural language, therefore, is much more convenient
since any normal user can interact with their data without a
background of database and SQL Figure 1 shown an example
of the text-to-SQL generation task
Recently, there are various approaches have been proposed
to solve this task Firstly, the components in SQL query are
Fig 1 Example of transforming natural language query to SQL query.
identified manually by using rules [11] The other approaches formalize text-to-SQL task to machine translation problem [5], [8], [9] Semantic parsing is also applied to generate SQL query structure like context-free grammar [14] Most of current approaches only handle very primary queries like WikiSQL dataset [16] While a large number of queries are in complex structure, they include optional components such as join, group
by and nested queries An example which returns the names
of students whose Math scores are greater than the average score of all the student in Ha Noi city as follow:
SELECT Ho_ten FROM Diem_thi WHERE diem_toan > (SELECT AVG(diem_toan)
FROM Diem_thi WHERE cum_thi is "Ha Noi")
In our work, we categorize the SQL queries into three levels based on the complexity of SQL query structure There are simple, medium, and complex level In the simple level, SQL queries only contain the primary components An example which returns the names of students whose Math scores are greater than 9, is illustrated by the following query: “SELECT Ho_ten FROM Diem_thi WHERE diem_toan > 9”
In the medium level, the input query requires knowledge
to analyze and understand to SQL query For example, “the
978-1-7281-3003-3/19/$31.00 ©2018 IEEE
Trang 2names of cities with more than 10 students whose scores are
greater than 27 in total score” are shown in the below query
Transforming this input query requires knowledge on “total
score” which is sum of 3 scores (diem toan, diem li and
diem hoa)
SELECT cum_thi FROM Diem_thi
WHERE (diem_toan + diem_li + diem_hoa) > 27
GROUP BY cum_thi
HAVING COUNT(*) AS so_thi_sinh > 10;
In the highest level - complex, the input query includes
sophisticated components such as joint tables, nested queries,
and sub-query Next example is the following query, “which
returns the most frequent Math score in Ha Noi city”
SELECT diem_toan FROM
(SELECT diem_toan, COUNT(*) AS
so_thi_sinh
FROM Diem_thi
WHERE cum_thi IS "Ha Noi"
GROUP BY diem_toan
ORDER BY so_thi_sinh DESC
LIMIT 1)
Most of current approaches can only handle a part of the
simple level and a small part of the complex level
Analyzing and understanding natural language query into
SQL query are challenging tasks Firstly, SQL query contains
complex structure such as multi-nested query Secondly, this
task is not only depend on the input query, but also database
architecture including list of tables, table structure and
rela-tions among tables
In this paper, we propose a new approach for analyzing and
understanding natural language query into SQL query The
process of our approach consists of two major phases: (1)
com-ponent segmentation phase which identifies primary clauses
in SQL architecture such as SELECT, WHERE, GROUP
BY, etc., and (2) slot-filling phase that helps extract several
sub-components for each primary clause such as SELECT
column(s), SELECT aggregation operation, etc We focus on
solving natural language query in the simple level Our work
has three main contributions as follows:
• We proposed a novelty two-phase approach to analyze
and understand natural language query
• We built machine learning models to solve component
segmentation problem and slot-filling problem, by using
Conditional Random Field method
• We also built a medium-size dataset of Vietnamese
natural language queries for evaluation and achieved
promising results
II RELATEDWORK
Semantic parsing In semantic parsing for representation
learning for sequence generation, natural language descriptions
are parsed into logical forms [3]
As a sub-task of semantic parsing, earlier work focuses on
specific databases [4], [7], [10], [12] Recent research
con-siders generalizing the new database by incorporating user’s
guidance [5] Another direction incorporates the data in the table as an additional input [8], [9] The limitations of these approaches are security and scalability issues while handling large scale user databases In 2017, Zhong et al proposed the Seq2SQL [16], a deep neural network combined with policy-based reinforcement learning In 2017, Xu et al proposed SQLNet to solve the order issue that Seq2SQL encountered [13] In 2018, Tao Yu et al proposed TypeSQL based on the architecture of SQLNet and format the task as a slot filling problem [15]
Natural language interface for databases One pioneer-ing study is PRECISE [10], which maps the token in the corresponding query with column attributes, values in the database table Giordani and Moschitti translate questions to SQL queries by first generating candidate queries from a grammar then ranking them using tree kernels [4] The above approaches depends on the accuracy of the grammar and are not suitable for tasks that require generalization to new schema Iyer et al approaches by using neural network model sequence to sequence (Seq2Seq model) [5] The limitation of Seq2Seq model can be overcome by adding human feedback
III OURAPPROACH
A Analyzing and Understanding Natural Language Query Figure 2 and 3 have shown an detailed example of these two phases and input/output in the process
Fig 2 An example of Two-phases in the task.
Fig 3 The process of Two-phases in the task.
In the component segmentation problem,
Trang 3{select, condition, group by, order by, other} as shown in
Figure I Given an input Vietnamese natural language query
x = (x1, x2, , xn), the component segmentation will segment
the input query into list of SQL clauses Cx= {ci(lcsi , si, ei)}
For each component ci(lcsi , si, ei) ∈ Cx, lcsi (∈ Lcs) is a
component type, si and ei are position of the start token
and the end token of ci in x In this work, we focus
on the simple level of natural language command, which
means that components are non-overlapping The component
segmentation is formalised as sequence tagging problem
TABLE I
C LAUSE TYPES
CLAUSE
TYPES SHORT DESCRIPTION
select sel
Component execute a query that retrieves the information in the database table.
condition col Condition for filtering information in
the database table.
Group by group by Merging records with the same value
in a column or multiple columns.
Order by order by Executing sort results and perform 2
operations (max, min).
Similarly, the slot-filling is also formalised as sequence
tagging problem The list of label in slot-filling phase consist
of four type Lsf = {column, aggregator, operator, value}
shown in Figure II
TABLE II SLOT TYPES SLOT
TYPES SHORT DESCRIPTION
column col Column name in SELECT, WHERE,
etc clause.
aggregator agg Operations: sum, count, average, none.
Operator op by
Comparing operations in where clause for data types of text, numberic (datetime data brought to numberic).
Value val the values in where clause and order
by clause (desc, asc).
B Building Analyzing and Understanding Natural Language
Query Model with Conditional Random Field
For segmentation problems and slot-filling problems,
lin-earchained graphical models like conditional random fields
(CRFs) [6] and Hidden Markov Model (HMM) [1] have
been proven effective based on their encoding the sequential
dependencies between consecutive positions We use CRFs
model to solve these above problems
We use IOB format to represent label for both of the
component segmentation task and the slot-filling task In the
component segmentation task, we define the set of class
la-bels Lcs = {select, condition, group by, order by, other}
The B < component type > indicates the first token
of a component and I < component type > is the
next or last token of that component O is outside of com-ponents Similarly, the set of slot-filling labels is Lsf = {column, aggregator, operator, value}
Training or estimating parameters for CRFs model is to search the optimal weight vector θ = (λ∗1, λ∗2, , λ∗n) that commonly performed by maximizing the likelihood function due to using advanced convex optimization techniques Recent studies have shown that L-BFGS are efficient Prediction labels for new input x is calculated by y∗= argmaxy ∗ ∈Lpθ ∗(y|x)
C Feature Templates for Building The Analyzing and Under-standing Natural Language Query Model
Feature selection is an important part in the analyzing and understanding natural language query model The more specific characteristics of each label that feature template could cover the higher the accuracy of the model Therefore, we design a variety of highly discriminative features shown in Table III The first is contextual feature We use a window
to extract contextual information from word around current position {w−n, , w−2, w−1} is previous words, w0is current position and {w1, w2, , wn} is next words; where n is window size In the component segmentation, we assign the value to n as 7 In slot-filling, n is 5
TABLE III
F EATURE TEMPLATES TO TRAIN THE CRF S MODEL
Contextual feature Context predicate templates Left context [w −n ], , [w −2 ], [w −1 ] Current token [w 0 ]
Right context [w 1 ], [w 2 ], , [w n ] POS tag A Part-Of-Speech Tag Current token [pos 0 ]
Orthographic Orthographic projection Current token [or 0 ]
Dictionaries Text templates for matching databaseinformation
is column name 1-token: [w 0 ]
is table name 2-token: [w −1 w 0 ], [w 0 w 1 ]
3-token: [w −2 w −1 w 0 ], [w 0 w 1 W 2 ] Component Type of component which current token
is in
in select
in condition [l cs
0 ], only in the slot-filling model
in group by
in orderby
In addition to text content, these models used information about part-of-speech (POS) tags of words For languages like Vietnamese, word boundary must first be identified Hence, these models actually use three kinds of information: word tokens, word orthographic, and POS tags of segmented words These add richer features to the models and, therefore, help
to achieve better component segmentation performance Besides, we also use dictionary and clause information for looking-up features: is column name, is table name There
is an additional feature for slot-filling model that is compo-nent type Compocompo-nent type feature is the label information form previous phase: in select, in condition, in group by and
in order by
Trang 4IV EVALUATION
A Experimental Data
To evaluate the proposed method, we asked annotators
to annotate Vietnamese natural language query dataset We
obtained a mediumsize data set consisting of 1258 queries
on 3 database: High school final test scores, Flight and Book
database Figure 4 and Figure 5 show some statistics in the
dataset including the number of samples corresponding to
components and slot-filling and its proportion in the entire
data set
Fig 4 Label Statistic in the component segmentation phase
Fig 5 Label Statistic in the slot-filling phase
We divide the dataset into 5 folds with train/test splits and
calculate results of the best model per each phase
B Experimental Results and Analysis
In order to prove the performance of the proposed CRFs
model, we conducted experiments to build HMM and Support
Vector Machine (SVM) models [1], [2] with similar feature
se-lection and consider analyzing experimental results carefully
The experiment results illustrate that CRFs achieve the
best result among three models for both of the component
segmentation phase and slot-filling phase It is easy to see
that through Figure 6 and Figure 7 Predicting the most likely
output label sequence in CRFs is not only based on the current observation and current state but also the past and the future observations and states, that is the reason why CRFs outperform HMM SVMs achieve the lowest result among three models Because SVMs do not consider state-to-state dependencies and observation-to-state dependencies like CRFs
or HMM do Furthermore, CRFs propagate the probability
of a state sequence given the observed sequence to mitigate this issue, while SVMs only separate the data into categories
by mapping the data points onto an optimal linear separating hyperplane
Fig 6 Precision, recall and F1 score in CRFs, HMM and SVM model in component the segmentation phase.
Fig 7 Precision, recall and F1 score in CRFs, HMM and SVM model in the slot-filling phase.
Table IV shows performance for each label in the com-ponent segmentation phase The micro-averaged F1-score is 93.48%, it is means that we can achieve a high accuracy level with this feature selection Table IV also indicates that accuracy of selection and groupby is higher than condition and orderby This is understandable because WHERE clause and ORDER BY are more ambiguous
The results reported in Table V show precision, recall and F1-score of each slot type The slot-filling phase get high performance with 91.9% F1-score The aggregator and
Trang 5TABLE IV
P RECISION , R ECALL AND F1- SCORE OF THE COMPONENT SEGMENTATION
MODEL WITH CRF S
Type Preci-sion Recall F1-score
group by 97.50 97.50 97.50
order by 88.89 90.91 89.89
Average micro 93.43 93.54 93.48
operator have high performances, because aggregator only
belongs to the SELECT clause and operator belongs to the
WHERE clause While the column and value information in
SQL query can belong to a lot of different clauses such as
SELECT clause, GROUP BY clause, ORDER BY clause,
which can lead to ambiguity in predicting precise clause for the
column information Therefore, their performances are lower
and unstable
TABLE V
P RECISION , R ECALL AND F1- SCORE OF THE SLOT - FILLING MODEL WITH
CRF S
Preci-sion Recall F1-score sel-col 87.23 80.39 83.67
cond-col 93.56 92.31 92.93
group by-col 97.50 97.50 97.50
order by-col 87.76 87.76 87.7
sel-agg 96.91 98.95 97.92
cond-op 93.75 96.15 94.94
cond-val 91.81 93.08 92.94
order by-val 92.31 87.80 90.90
Average micro 92.62 91.18 91.90
Figure 8 presents some examples of predictions by the
model and ground truth results In the simple level, which only
contain the non-overlap components, our model could segment
SQL clause and extract subcomponent quite correctly like the
first and the second example In the higher levels of query like
third example, our model predicts ”Khoi A” (Combination A)
as column, which is incorrect In this case, we needs domain
knowledge to understand “combination A” Additionally, our
model cannot cover the complex queries which contain joint
tables and nested query
V CONCLUSION
In this work, we propose a new two-phase approach for
converting Vietnamese natural language query into structured
query language The results of this research might assist both
experienced and inexperienced users to manage databases
easily In our novelty approach, both two phases are formalized
as the sequence tagging problem which are easily to be applied
some notable machine learning methods to solve In the scope
of this research, CRFs model outdoes HMM model and SVM
model This approach has completely solved the problem at a
simple level of natural query In the next step, the approach
Fig 8 Example of predictions by the model and and ground truth results.
Q denotes the natural language query L and L’ denote the ground truth label and the label produced by the model S and S’ denote the ground truth SQL query and the SQL query produced by the model.
can be developed further by solving the natural language query
at medium level and high level
REFERENCES [1] Baum, L.E and Petrie, T., 1966 “Statistical inference for probabilistic functions of finite state Markov chains.” The annals of mathematical statistics, 37(6), pp.1554-1563.
[2] Cortes, C and Vapnik, V., 1995 Support-vector networks Machine learning, 20(3), pp.273-297.
[3] Dong, L., and Lapata, M (2016) “Language to Logical Form with Neural Attention.” CoRR, abs/1601.01280.
[4] Giordani, A and Moschitti, A., (2012) “Translating questions to SQL queries with generative parsers discriminatively reranked.” Proceedings
of COLING 2012: Posters, pp.401-410.
[5] Iyer, S., Konstas, I., Cheung, A., Krishnamurthy, J., and Zettlemoyer, L.S (2017) “Learning a Neural Semantic Parser from User Feedback.” ACL.
[6] Lafferty, J., McCallum, A and Pereira, F.C., 2001 “Conditional random fields: Probabilistic models for segmenting and labeling sequence data.” [7] Li, Y., Yang, H., Jagadish, HV (2006), “Constructing a generic natural language interface for an XML database,” In EDBT, Vol.3896, pp 737-754.
[8] Mou, L., Lu, Z., Li, H., and Jin, Z (2017) “Coupling Distributed and Symbolic Execution for Natural Language Queries.” ICLR.
[9] Pasupat, P., and Liang, P.S (2015) “Compositional Semantic Parsing
on Semi-Structured Tables.” ACL.
[10] Popescu, AM., Etzioni, O., Kautz, H (2003), “Towards a theory of natural language interfaces to databases”, In Proceedings of the 8th International Conference on Intelligent User Interface, pp 149-157, In ACM.
[11] Stratica, N., Kosseim, L and Desai, B.C., 2005 “Using semantic templates for a natural language interface to the CINDI virtual library” Data & Knowledge Engineering, 55(1), pp.4-19.
[12] Wang, C., Cheung, A., Bodik, R (2017), “Synthesizing highly expres-sive SQL queries from input-output examples,” In Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation, pp.452-466, In ACM.
[13] Xu, X., Liu, C and Song, D., 2017 “Sqlnet: Generating structured queries from natural language without reinforcement learning” [14] Yaghmazadeh, N., Wang, Y., Dillig, I and Dillig, T., 2017 “SQLizer: query synthesis from natural language Proceedings of the ACM on Programming Languages,” 1(OOPSLA), p.63.
[15] Yu, T., Li, Z., Zhang, Z., Zhang, R and Radev, D., 2018 “Typesql: Knowledge-based type-aware neural text-to-sql generation.”
[16] Zhong, V., Xiong, C and Socher, R., 2017 “Seq2sql: Generating structured queries from natural language using reinforcement learning.”