L2S: Transforming natural language questions into SQL queriesDuc Tam Hoang Faculty of Information Technology, VNU University of Engineering and Technology, Vietnam National University, H
Trang 1L2S: Transforming natural language questions into SQL queries
Duc Tam Hoang Faculty of Information Technology,
VNU University of Engineering and Technology,
Vietnam National University, Hanoi
tamhd1990@gmail.com
Minh Le Nguyen School of Information Science, Japan Advanced Institute of Science and Technology
nguyenml@jaist.ac.jp
Son Bao Pham Faculty of Information Technology, VNU University of Engineering and Technology, Vietnam National University, Hanoi
sonpb@vnu.edu.vn
Abstract—The reliability of a question answering system is bounded
by the availability of resources and linguistic tools In this paper,
we introduce a hybrid approach to transforming natural language
questions into structured queries It alleviates the lack of experts in
domain observation and the deficient performance of linguistic tools.
Specifically, we exploit the semantic information for mapping natural
language terminologies to structured query and bipartite graph model
for the matching phase Experimental results on the Vietnam national
university entrance exam dataset and the Geoqueries880 [27] dataset
achieve accuracies of 91.14% and 87.55% respectively.
I INTRODUCTION
A reliable QA system requires an approach capable of exploiting
details from the question and the background knowledge of domain,
which makes closed domain QA system domain-dependent
State-of-the-art approaches generally learn from a set of annotated training
data If the domain has a rich training data, it is feasible to develop
a statistical approach However, if the training data is absent,
espe-cially with under-resourced languages such as Vietnamese, statistical
approaches are at a deadlock [10]
Rule-based (grammar-based) approaches pose different types of
problem Firstly, extensive effort of experts are required for
hand-crafted rules Secondly, if the database changes, a typical
grammar-based approach demands a huge workload creating new set of rules
otherwise the accuracy will fall significantly [11]
L2S aims to deal with problems described above Our objective is
an approach which tolerates the lack of an annotated corpus, experts’
effort in domain analysis and powerful linguistic tools In this sense,
linguistic tools are the underlying tools which support the process of
QA system, such as tokenizer, part-of-speech tagger, syntactic parser,
dependency tree parser and named entity recognizer One failure of
linguistic tools generally lead to an incorrect answer
The database type plays a crucial role in building up a QA
system Over recent years, the growth of data in tabular presentation
has brought attention to QA using relational databases Structured
databases have become a main type of data format, beside written
free text Moreover, the conversion from semi-structured resources
(such as tables) to relational database is straightforward, compared
to other knowledge representation such as Ontology
We propose a hybrid approach, which utilizes the semantic
information and graph model as a converter from questions posed
in NL to structured query database (SQL) Firstly, we propose a
method for mapping NL statement to SQL elements by exploiting the
semantic information Secondly, we constructs the bipartite graph to
evaluate the remaining tokens L2S aims to require no training data
and only minimal domain knowledge Currently, in the phase of data
preprocessing, an understanding of domain is necessary but it takes
far less effort compared to building the set of rules or collecting an
annotated training dataset
Apart from this introduction, the rest of the paper is arranged as follows Section II provides a review of related works with regard
to closed-domain QA system Section III presents the architecture
of L2S Section IV proposes our approach to solve the problem
of interlingual mapping Two experiments that we tested L2S are presented in Section V Finally, Section VI summarizes the main points and discusses our future extension
II RELATEDWORKS
In this sections, we provide an insight into publications related
to closed-domain question answering system and the database that they target
Question answering has been addressed by numerous systems following a variety of approaches Current approaches are generally categorized into two main types: statistical and non-statistical ap-proaches
Statistical approaches are dedicated to the manipulation of sta-tistical models From the set of pairs between natural language and machine language, the statistical approach first builds the training
set of correct and incorrect question-answer pairs [13] The QA
problem is then transformed into a binary classification task with
label correct mapping and incorrect mapping between the question
and the answer/query Features for machine learning may be extracted
by obtaining tokens and syntactic trees of questions and queries [12] Notably, statistical approaches have only begun to receive seri-ous attention recently in some specific areas, such as the medical domain [9] With a huge set of annotated training data, statistical approaches demonstrate promising results [14] [24]
Regarding non-statistical approaches, the majority of them are rules-based approaches [5] Compared to statistical approaches, which gain attention in research-oriented works, rule-based ap-proaches find favour with real-life industrial systems The main com-ponent of rules-based approaches is a set of QA patterns [23] [8] [22] Because the patterns are written by experts with extensive do-main knowledge, a rule-based system gains promising result with a relatively small set of patterns When the set of patterns gets larger,
it is difficult to manage or improve System accuracy is not always improved when adding new rule as the new rule may conflict with the existing rules
There are also non-rule approaches such as syntactic-based anal-ysis [2], Prolog-like representation [25] and graph-based approach [21]
A Converting from a natural language question to a structured query
Database is a critical factor in QA High-level representation of data (such as Ontology) supports complex operation but the quantity
2015 Seventh International Conference on Knowledge and Systems Engineering
2015 Seventh International Conference on Knowledge and Systems Engineering
2015 Seventh International Conference on Knowledge and Systems Engineering
Trang 2of available data in such format is sparse In many systems, the
knowledge has to be converted manually from written text For
example, if a method uses Ontology, a stage converting data to query
structured database or free context into Ontology format is inevitable
There is a trade-off between accessing database and transforming text
into database
Aqualog [15], QuestIO [7] and followed-up systems achieve
promising results, but they are not popular in the industry
be-cause of their knowledge representation Ontology OWL The
ad-hoc transformation between a written corpus or a Structured Query
Database and an Ontology is laborious Automatic processing for
the transformation is far from something which could meet the
requirements of QA systems Sometimes, it is difficult to overcome
the problem of modeling data in Ontology relations, even for human
VnQAS [19] is a followed-up system of Aqualog It is one
of the notable QA systems for Vietnamese [17] Based on the
structure of Aqualog, authors of VnQAS transformed the question
into knowledge representation via the pattern matching technique For
the demonstration, they manually created an Ontology (15 concepts,
17 attributes and 78 instances) and a set of hand-written patterns for
questions in the domain of organizational structure
III L2S ARCHITECTURE
Given a question Q relating a specific database D, L2S aims to
transform Q into a SQL query, which is in the form of SELECT
FROM WHERE The field WHERE can be either a single condition
or a joint of multiple conditions A condition is a pair between two
variables e i and e j which are compared by an operator o.
L2S consists of three modules First, Q and D are analysed by
the Preprocessing Module Then, in the Matching module, data is
processed through the Semantic Matching component and the
Graph-based Matching component respectively Finally, the Generating
Module is responsible for building a complete SQL query
A Preprocessing module
From Q and D, this module extracts all features for the
match-ing phase There are three pre-processmatch-ing components: Lmatch-inguistic
component (to analyse Q), Lexicon component (to analyse D) and
Ambiguity Solving component (to correct the input)
Linguistic component analyses Q to generate a set of tokens
W and establish the sentence attachment constraint Tools includes
Name Entity recognizer[18] (NER) and Coltech-parser1in GATE[6]
Through the use of Java Annotation Patterns Engine (JAPE)
gram-mars, we improved the performance of those tools to extend the
number of entities that it could detect Finally, an tokenization set W,
a set of name which acquired by improving named entity tags T and
the attachment constraint are the output of the Linguistic component
Two words w1and w2of W are attached if they are a pair of object
and complement or a pair of subject and question word
Lexicon Component analyses D for its elements and establishes
the database attachment constraint We define three types of element:
DB relation (associated with table name), DB attribute (associated
with table column) and DB value (associated with value) Among
those types, suppose that element e1is a DB attribute, then element
e2is attached to e1if it is a value of column e1 or it is a relation
containing e1 This component extracts all elements before comparing
them to a synonym dictionary to build an interface of database
1http://www.jaist.ac.jp/~nguyenml/NhomQA/coltechparser.zip
Ambiguity Solving component guarantees a sound input (no
ambiguity) for the matching module Taking the output of the Linguistic Component and the Lexicon Component, L2S compares all
words of W to the elements of D A word w1is evaluated ambiguous
if it matches with zero or more than two of the same type We have tried the method of ellipsis or choosing the highest possibility but it was inefficient as sometimes it leads to failure In this case, L2S first
retrieves all words {w2, w3, , w n} in which the similarity function
Ps (w i , w1) > λ There is then a possibility to engage the users by
clarifying the input with suggestion from similar words
B Matching module
The Matching module is the core of a QA system In is sometimes called by different names but serves the same purpose of interlingual mapping In L2S, it is responsible for producing a list of specific SQL elements from the output of previous step The matching module has
to support three tasks as follows
Firstly, function words in Q are identified They are the words associated with the function in the SQL query, such as question words, comparing words and linking words They are processed
differently from other words
Secondly, each input word w i is matched with the associated o j
from database o j might be a value, an attribute or a relation The matching module has to resolve the ambiguity when one input word might match multiple objects
Thirdly, each pair (w i , o j ) is linked to another pair (w k , o h ) if there is a conditional relation between w i and w k If there is no pair
(w k , o h ) given by the question, L2S has to retrieve it.
Because the Matching module is the most difficult part in QA problem, each QA system undertakes a different technique The proposed method used inL2S is described in details in section IV
C Generating module Taking all the elements list E from the Matching module, this
module is responsible for creating a sound SQL query which delivers the exact answer
Operator Action: Considering possible SQL queries which can
be drawn from the database, we introduce a new phase to assign suitable operator for each conditional pair The set of operators
between attribute and value includes = equal, <smaller, >greater,
>=greater or equal, <=smaller or equal, LIKEequal in text stringand LIKE %(X)% contain a string This component is to decide the
suitable operator for each condition pair.
SQL Generator: This component takes all output from the
previous steps and provides a concise SQL query In three main
portions of the SQL query, the SELECT portion of the query
is determined by the question type and its target The WHERE
portion contains a conjunction of attributes and their values with join
conditions in the case of multi-relation The FROM portion contains the relations for the attributes that appear in WHERE.
IV BACKGROUND OF INTERLINGUAL MAPPING ANDL2S
PROPOSALS With the main objective of qualitatively resolving: (i) the absence
of a huge annotated corpus caused by language deprivation, (ii)
no laborious effort to write the whole set of patterns, (iii) poor performance of underlying tools including named entity tagger, dependency parser and even tokenizer, this paper proposes a new
Trang 3method in QA to handle the question-answering task by processing
in words level Our approach combines two approaches:
accurate semantic properties provided by the underlying
linguistic tools
elements/entities
A Process query based on semantic information
A semantic role is the relationship between a syntactic constituent
and a predicate [16] This defines the actual meaning of a word inside
the sentences We use named entity recognition tagger and parser to
exploit the semantic information within a sentence
Some studies have used semantic roles in answer extraction
modules, but they have all been employed as a support for delivering
meaningful answer [16] In L2S, we explicitly create a mapping
between semantic information of the question and so-called
seman-tic information of database elements The semanseman-tic information is
evaluated based on two concepts name and attachment.
First, we define name as a set of labels for word class, similar to
the definition of of a named entity The set of names is established
by manually adjusting the tags produced by named entity tagger
Meanwhile, the nature of an element in database is defined by
the object that contains it If the element is a value, then the
corresponding object is its column, which is called attribute This
is reasonable when all person’s names are in the column “person”
(or its synonyms) An important factor is that the lexical properties
of name are distributed in every language, making an independent
feature to the domain shift
Second, the attachment relationship between two words is based
on the dependency tree Two words are marked as attached if and
only if they are either connected or are children of the same node in
dependency tree In terms of database, two elements are considered
attached when they are value and attribute in the same column, or
they can be attribute and relation in the same scheme
Since the lexical property alone is against the definition of
context, if there is one entity playing the role of a subject, its object
and their relationship must be identified On the contrary, if the entity
plays the role of an object, it is important to know which is the
subject
The Algorithm 1 illustrates our method to leverage the semantic
E (elements in database) obtained from the preprocessing module,
Algorithm 1 generates a list of paired elements P
type word word order semantic link synonym
Hoàng_Việt
Hoang Viet
có
has
received
bao_nhiêu
how many
điểm
marks?
?
?
Person
Candidate ngày_sinh
birthday Date
Question Word
Figure 1: Semantic Processing: How many marks that Hoang Viet
who has birthday (on) 09-08-1994 received?
sinhbirthday09-08-199409−08−1994đượcreceivesbao nhiêuhowmany
điểmmarks?” Tokens “Hoàng Việt”, “09-08-1994” and “bao nhiêu”
have recognizable tags Person, Date and QuestionWord respectively.
Algorithm 1: Semantic mapping
Input: Finite sets W = {w1 , w2, , w n } and
Output: list of SQL elemets P.
1 P ← ∅
2 for i ← 1 to n do
4 T s ← synonyms of tag i
6 if T = Ø then
10 e i ← f s (w i , S)
13 P ← (w i , e i , T)
15return P
From the tag Person, the synonym Candidate is retrieved to make a
shows the semantic processing for this example
B Process query based on the graph-based model
The use of graph-based algorithms in computational linguistics could be found in statistical learning [1], question answering system [21], clustering system [26] and so on The algorithms rely on a similarity graph consisting of nodes representing linguistic features
By resolving the graph such as finding minimum spanning tree, minimum circle or maximum matching, the result will be transformed back to the original problem
Our goal is to exploit graph-based method for improving the consistency in mapping the sentence tokens to the database elements
Intuitively, a set W of tokens should be paired with the equivalent
elements in the database This means that the method has the ability
to find the correct pair while one token could be similar to a number
of database elements
We follows the idea in [21] to resolve the problem of question answering based on bipartite graph [3] L2S constructs a bipartite
graph G(V,C) which the vertex set V = W + E, the edge set C is formed by all mapping between W and E by the string similarity
method for name matching [4] The constraint in this graph is the
attachment in semantic information T herefore, the QA task is now
converted into the problem of finding the maximum matching in G.
To solve the problem in G, a directed graph G’(V’,C’) is built.
V’ contains all the nodes of V along with a source node s and a sink
node t The capacity of all nodes in V’ is 1 except for s and t, which are given by the number of nodes which could be linked C’ is set
of directed edges from s through W and E to t Each edge is given
unit capacity 1 for default
The Algorithm 2 represents our method for solving the problem
of directed graph G’ Let f be an integral flow of G’ of value k, it is straightforward to conclude that the set of edges carrying flow in f forms a matching of size k for the graph G Therefore, the solution of the maximum flow in G’ gives us a maximum matching in G Finally,
Trang 4Algorithm 2: Maximum bipartite matching: Ford-Fulkerson
Input: Graph G’, source node S, sink node T
Output: A flow f from S to T which is maximum
1 f (x, y) ← 0foralledges(x, y)
2 while∃ path p from S to T in G f ∀ (w,e) with capacity
c f (w,e) >0 do
3 c f (p) = min{c f (w, e) : (w, e) ∈ p} for (w, e) ∈ p do
4 f (w, e) ← f(w, e) + c f (p)
5 (send flow along the path)f (e, w) ← f(w, e) − c f (p)
6 (flow of the backward path)
7 return f
the maximum flow goes through all nodes which are expected to be
in the SQL query
V EXPERIMENT
A Dataset
In this section, we present an empirical evaluation to assess the
effectiveness of L2S in Vietnamese We conduct experiments on two
sample datasets with regard to two domains Each dataset consists of
two parts: testing questions and a database
The first dataset was taken from the domain of university national
entrance exam marks (UEEM) The database was taken from top
universities in Vietnam It contains the candidates’ results along
with their information, including name, date of birth, hometown and
identification number In term of testing questions, 429 questions
were collected from human users We divide the testing set into two
types: straightforward questions (68) and general questions (361)
Straightforward questions have simple structure, indicated by the
providers The question contains a complete meaning, with one
question word and no ambiguous term In contrast, general questions
are not bounded by the definition of simplicity They were expressed
in a natural way They might have more than one question words, or
no question word Some terms were omitted, some unrelated words
were added, which leads to the problem of ambiguity However, it
guarantees the perception of natural language This follows the rule
of noisy channel: the noisy channel makes what people said different
from what they think
The second dataset was in the domain of Geography (GEO) By
translating the set of Geoqueries880 [27] questions into Vietnamese,
we collect an original set of 880 questions2 All proper American
names were substituted by the corresponding names in Vietnamese
We filtered out similar questions to keep 498 distinct questions for
testing We employed three translators working separately Then they
discussed to generate one final set of testing question
Both datasets were collected carefully We not only make the
testing experiment for our approach, but also create two standard
testing dataset for QA in Vietnamese language From the domain, we
manually created a mapping table between the terminology of column
and the possible expression in question The table was created based
on the common knowledge towards each domain We keep the same
table for two experiments
To measure the performance of L2S, we build a baseline with the
graph-based approach, which treats the database as a dictionary [21]
Without the technique for solving ambiguity, the baseline ignores all
questions which are intractable - question with ambiguous words
2http://www.cs.utexas.edu/users/ml/geo.html
or misplace in attachment feature The baseline directly transforms the question and database elements into graph, then extracts the query from the result of maximum bipartite matching algorithm By comparing L2S with the Baseline, we will evaluate the effectiveness
of our approach with semantic information That is the difference of our hybrid approach and a standalone approach
If a statement is processed, an answer is output from L2S or
the baseline The answer is correct if its query derives the exact
information that has been asked One question may accept several queries, but it has a unique result where they are evaluated by human annotators The annotators perform two actions First, they execute the query Then, if the query are successful executed, the result is compared with the precise answer provided by experts The answer
is correct if and only if the query is executable and the obtained
result and the precise answer are identical If the obtained result is
different from the answer, it is incorrect If the query is not delivered
or unexecutable, the answer is invalid.
B Experiment 1
We first analyze the testing sets (68 straightforward and 361
gen-eral) Most questions belong to three main categorizes: entity - asking about specific subject/object, number - asking for quantity/ranking of
a group of subject and ratio - asking for the proportion.
Table I shows the difference between two testing sets of UEEM While the majority of straightforward questions are entity (70.59%), general questions are divided more evenly Especially, 9.14% of all the general questions are statement sentences, which do not contain question word This would lead to error with hand-written patterns designed to capture the question structure
The database of UEEM contains three types of information:
relation (names of the university), attributes (“identification number”,
“name”, “marks” and title of other information) and value (value of each instance in the database) It is common that two candidates receive the same mark in [0-10] Therefore, one value may belongs
to different instances, posing a high ambiguity in this database
To guarantee a sound input, we implemented the prepossessing module to enhance this performance in QA task The source code and all updates of the completed system have been published online3 Table II illustrates the precision, recall, F-measure and accuracy
of the two testing sets Precision is the percentage of correct answer
in total answer Recall is the percentage of correct answers in the set
of correct answer and no answer The accuracy is the percentage of correct answers in total questions L2S answers all questions, while
the Baseline only answers tractable questions Therefore, the recall
of L2S is always 100%
Table II: Experiment with UEEM dataset
Sim test Precision Recall F-measure Accuracy
Gen test Precision Recall F-measure Accuracy
The precision of Baseline remains stable around 76% However,
it refuses to answer intractable questions, which was abundant in the
general sets Therefore, the recall and F-measure drops considerably
3http://sourceforge.net/projects/l2s/
Trang 5Table I: simple and general test set of UEEM dataset
Testing sets Total Entity question Number question Ratio question Non-question
500 highest IDF words Named Entity Numeric Proper noun Question words Other types
L2S have achieve a high result with accuracy of 91.13% in the
general test, compared to 21.89% of the baseline There are three
main reasons for this distinction
• Linguistic tools failure, including the tokenizer and parser A
minor incorrect positions in the parser leads to intractable
question, this is resolve by using semantic information in
L2S
• L2S recognizes compare words like “lớn hơn” greater, “nhỏ
hơn”smallerand so on as a named entity The Baseline could
not find an equivalent element for them in the database, treat
them as intractable.
• General questions contain unknown words, either stop words
or words not in database
However, some mistakes in the tokenization and question words
processed lead to incorrect answers in both L2S and the baseline
C Experiment 2
A brief analysis of the second dataset shown that the proportion
of named entity in 500 words which has the highest IDF value is
81.3% Similar to the UEEM dataset, named entities in the second
experiments are mostly proper name of locations, rivers, mountain,
question words and comparing words They plays an important role
in question answering task
The word frequency in this dataset is not as high as the first
dataset Given a random element, such as name of a river, height of
a mountain or population of a province, there is no boundary of its
value The number of two elements which share the same value is
smaller than the first database In other words, this database is less
ambiguous than the first one
Nevertheless, this dataset provides a new challenge In the
database, 35.6% of value (including mountains, lakes and rivers) have
strange name A strange word originates from the local language,
such as “T’nưng”(lake), “phan xi pan” (mountain), “Xi giơ Pao”
(mountain) and so on They are localized languages, leading to failure
of linguistic tools In our prior analysis, all Vietnamese available
linguistic tools cannot handle these names
Table III lists results of Baseline and L2S on the testing set Since
we are interested in the performance with regard to new domain, we
evaluate the F-measure of both systems
Table III: Experiment with Geoquery dataset
Gen test Precision Recall F-measure Accuracy
The baseline results show that, for Geographical questions and
Baseline system, the mis-matching queries are less common They
make the precision of Baseline higher, 83.3% This is not surprising
if we remember that the UEEM database was more ambiguous The Baseline results are interesting because they indicate that the graph-based method is somehow effective The less ambiguous the domain
is, the more efficient it achieve However, it ignores the majority of
questions due to intractable feature This method alone cannot be
used for actual system
Next we measure the performance of L2S in the new domain Be-cause the configuration of L2S between two experiments are mostly the same, we compare the results and put forward our evaluation The precision and F-measure of L2S in the second experiment is lower than the first one (general test set and simple test set) The main reason is that the failure of tokenization and named entity recognition
As all other linguistic system, L2S has one crucial point is that the tokenizer has to works precisely
Overall, the performance of L2S is promising L2S has proved it robustness in two sample datasets With the same linguistic tools for
an under-resourced language like Vietnamese, L2S does not require annotated training data nor a set of hand-crafted rules Comparing the result with the first experiment, we can say that L2S has demonstrated it effectiveness in dealing with different domain This proved the effectiveness of our hybrid approach, combining semantic information and graph-based model
D Discussion
The experiments are conducted in two different datasets with
no available annotated training set or hand-crafted rules Human intervention was minimized to the common knowledge of the do-mains The result strengthens our hypothesis that it is viable to build
a reliable question answering system in under-resourced languages Two experiments were conducted with the same configuration of system No extensive observation of the domain for hand-written rules
or annotated features is require The whole workload is significantly smaller than the time and effort spending on writing structured rules (in a typical rule-based system)
We sampled incorrect results from all experiments For each incorrect answer, it is analyzed to find out the main cause
Table IV: The main reason of errors
Linguistic tools failure 38.23% 4.86%
Question words ambiguous 4.15% 4.15%
Table IV shows a significant reduce in three main causes of error Problems that the Baseline refuses to answer was resolved by L2S
On the one hand, L2S leverage the precise output of one linguistic tool to overcome the failure of others For example, question is
“Thí sinh nào có ngày sinh là 06-03-1990?” The dependency parser failed to identify the relation between “ngày sinh” and “06-03-1990” However, L2S detects the word “06-03-1990” has the
Trang 6name tag “date”, which is mapped to synonym of “ngày sinh”.
Even if the word “ngày sinh” is removed from question, L2S still
retrieves appropriate pair for “06-03-1990” and successfully deliver
the answer We are developing the method to overcome the limitation
of question classification based on lexical tag
VI CONCLUSION
In this paper, we present our hybrid approach for developing QA
systems in specialized domains L2S contains one novel method of
semantic processing and one followed up method in graphed-based
processing It overcomes the weakness of current approaches with
regard to the lack of training data, domain observation and week
underlying linguistic tools
L2S exploits information from underlying tools to select the
reli-able semantic information Then, the graph-based processing handles
the remaining tokens and entities between two languages (natural
language and SQL languages) By combining two approaches, L2S
alleviates the heavy dependence on linguistic applications and domain
knowledge
The experiments indicate effectiveness of the hybrid method The
first experiment measured to what extent the presented approaches
are useful to answer straightforward questions and tricky questions
The second experiment demonstrates the robustness of L2S across
different domains Results show that L2S maintains the accuracy
over different domains and requires a small workload to switch the
domain
ACKNOWLEDGMENT This work is supported by the Nafosted project 102.01-2014.22
REFERENCES [1] Alexandrescu, A., Kirchhoff, K., 2009 Graph-based Learning for
Statistical Machine Translation, in: Proc of Human Language
Tech-nologies: The 2009 Annual Conference of the North American Chapter
of the Association for Computational Linguistics, Association for
Computational Linguistics, Stroudsburg, PA, USA pp 119–127
[2] Androutsopoulos, L., 1995 Natural Language Interfaces to Databases
- an Introduction Journal of Natural Language Engineering 1, 29–81
[3] Asratian, A.S., Denley, T.M.J., H¨aggkvist, R., 1998 Bipartite Graphs
and Their Applications Cambridge University Press, New York, NY,
USA
[4] Cohen, W.W., Ravikumar, P., Fienberg, S.E., 2003 A Comparison of
String Distance Metrics for Name-Matching Tasks, in: Proc of
IJCAI-03 Workshop on Information Integration, pp 73–78
[5] Costa, P., Almeida, J., Pires, L., van Sinderen, M., 2008 Evaluation
of a Rule-Based Approach for Context-Aware Services, pp 1 –5
[6] Cunningham, H., Maynard, D., Bontcheva, K., Tablan, V., 2002 GATE:
A Framework and Graphical Development Environment for Robust
NLP Tools and Applications, in: Proc of ACL, pp 168–175
[7] Danica Damljanovic, V.T., Bontcheva, K., 2008 A Text-based Query
Interface to OWL Ontologies, in: Proc of the Sixth International
Con-ference on Language Resources and Evaluation, Marrakech, Morocco
[8] Dat Quoc Nguyen, Dai Quoc Nguyen, S.B.P., Pham, D.D., 2011 Ripple
down rules for part-of-speech tagging, in: Proc of the 12th International
Conference on CL and Intelligent Text Processing, pp 190–201
[9] Demner-Fushman, D., Lin, J., 2007 Answering Clinical Questions
with Knowledge-Based and Statistical Techniques Comput Linguist
33, MIT Press, Cambridge, MA, USA pp 63–103
[10] Dien, D., Kiem, H., 2003 POS-Tagger for English-Vietnamese
Bilin-gual Corpus, in: Proc of the HLT-NAACL 2003 Workshop on Building
and using parallel texts Association for Computational Linguistics,
Stroudsburg, PA, USA pp 88–95
2008 Natural Language Database Interface for the Community Based Monitoring System, in: Proc of the 22nd Pacific Asia Conference
on Language, Information and Computation, De La Salle University (DLSU), Manila, Philippines pp 384–390
[12] Giordani, A., 2008 Mapping Natural Language into SQL in a NLIDB, in: Proc of the 13th international conference on Natural Language and Information Systems, Springer-Verlag, Berlin, Heidelberg pp 367–371 [13] Giordani, A., Moschitti, A., 2010 Semantic Mapping between Natural Language Questions and SQL Queries via Syntactic Pairing, in: Proc of the 14th International Conference on Applications of Natural Language
to Information Systems, Springer-Verlag, Berlin, Heidelberg pp 207– 221
[14] Kusumoto, T., Akiba, T., 2012 Statistical Machine Translation without Source-side Parallel Corpus Using Word Lattice and Phrase Extension, in: Proc of the Eight International Conference on Language Resources and Evaluation (LREC12), European Language Resources Association (ELRA), Istanbul, Turkey
[15] Lopez, V., Uren, V., Motta, E., Pasin, M., 2007 Aqualog: An ontology-driven question answering system for organizational semantic intranets Web Semant 5, 72–105
[16] Moreda, P., Llorens, H., Saquete, E., Palomar, M., 2011 Combining semantic information in question answering systems Inf Process Manage 47, 870–885
[17] Nguyen, Dat Tien, Hoang, Duc Tam and Pham, Son Bao, 2012
A Vietnamese Natural Language Interface to Database, Sixth IEEE International Conference on Semantic Computing, ICSC, Palermo, Italy 130–133
[18] Nguyen, D.B., Hoang, S.H., Pham, S.B., Nguyen, T.P., 2010 Named Entity Recognition for Vietnamese, in: Proc of the Second international conference on Intelligent information and database systems: Part II, Springer-Verlag, Berlin, Heidelberg pp 205–214
[19] Nguyen, D.Q., Nguyen, D.Q., Pham, S.B., 2009a A Vietnamese Question Answering System, in: Proc of KSE, pp 26–32
[20] Nguyen, P.T., Vu, X.L., Nguyen, T.M.H., Nguyen, V.H., Le, H.P., 2009b Building a large syntactically-annotated corpus of Vietnamese, in: Proc of the Third Linguistic Annotation Workshop, Association for Computational Linguistics, Stroudsburg, PA, USA pp 182–185 [21] Popescu, A.M., Etzioni, O., Kautz, H., 2003 Towards a theory of nat-ural language interfaces to databases, in: Proc of the 8th International Conference on Intelligent User Interfaces, ACM, New York, NY, USA
pp 149–157
[22] Saxena, A.K., Sambhu, G.V., Subramaniam, L.V., Kaushik, S., 2007 IITD-IBMIRL System for Question Answering using Pattern Match-ing, Semantic Type and Semantic Category Recognition, in: Proc of The Sixteenth Text REtrieval Conference, 2007, Gaithersburg, Mary-land, USA
[23] Sneiders, E., 2002 Automated Question Answering Using Question Templates That Cover the Conceptual Model of the Database, in: Andersson, B., Bergholtz, M., Johannesson, P (Eds.), Natural Language Processing and Information Systems, Springer Berlin Heidelberg pp 235–239
[24] Suzuki, J., Sasaki, Y., Maeda, E., 2002 SVM Answer Selection for Open-Domain Question Answering, in: Proc of the 19th International Conference on Computational Linguistics, , Stroudsburg, PA, USA pp 1–7
[25] Waltz, D.L., 1978 An English Language Question Answering System for a Large Relational Database Commun ACM 21, pp 526–539 [26] Wieling, M., Nerbonne, J., 2010 Hierarchical spectral partitioning of bipartite graphs to cluster dialects and identify distinguishing features, in: Association for Computational Linguistics, Stroudsburg, PA, USA
pp 33–41
[27] Wong, Yuk Wah, Mooney, Raymond, 2007 Learning Synchronous Grammars for Semantic Parsing with Lambda Calculus, Association for Computational Linguistics, Prague, Czech Republic pp 960–967