L2S Transforming Natural Language Questions into SQL Queries

L2S: Transforming natural language questions into SQL queriesDuc Tam Hoang Faculty of Information Technology, VNU University of Engineering and Technology, Vietnam National University, H

Trang 1

L2S: Transforming natural language questions into SQL queries

Duc Tam Hoang Faculty of Information Technology,

VNU University of Engineering and Technology,

Vietnam National University, Hanoi

tamhd1990@gmail.com

Minh Le Nguyen School of Information Science, Japan Advanced Institute of Science and Technology

nguyenml@jaist.ac.jp

Son Bao Pham Faculty of Information Technology, VNU University of Engineering and Technology, Vietnam National University, Hanoi

sonpb@vnu.edu.vn

Abstract—The reliability of a question answering system is bounded

by the availability of resources and linguistic tools In this paper,

we introduce a hybrid approach to transforming natural language

questions into structured queries It alleviates the lack of experts in

domain observation and the deficient performance of linguistic tools.

Specifically, we exploit the semantic information for mapping natural

language terminologies to structured query and bipartite graph model

for the matching phase Experimental results on the Vietnam national

university entrance exam dataset and the Geoqueries880 [27] dataset

achieve accuracies of 91.14% and 87.55% respectively.

I INTRODUCTION

A reliable QA system requires an approach capable of exploiting

details from the question and the background knowledge of domain,

which makes closed domain QA system domain-dependent

State-of-the-art approaches generally learn from a set of annotated training

data If the domain has a rich training data, it is feasible to develop

a statistical approach However, if the training data is absent,

espe-cially with under-resourced languages such as Vietnamese, statistical

approaches are at a deadlock [10]

Rule-based (grammar-based) approaches pose different types of

problem Firstly, extensive effort of experts are required for

hand-crafted rules Secondly, if the database changes, a typical

grammar-based approach demands a huge workload creating new set of rules

otherwise the accuracy will fall significantly [11]

L2S aims to deal with problems described above Our objective is

an approach which tolerates the lack of an annotated corpus, experts’

effort in domain analysis and powerful linguistic tools In this sense,

linguistic tools are the underlying tools which support the process of

QA system, such as tokenizer, part-of-speech tagger, syntactic parser,

dependency tree parser and named entity recognizer One failure of

linguistic tools generally lead to an incorrect answer

The database type plays a crucial role in building up a QA

system Over recent years, the growth of data in tabular presentation

has brought attention to QA using relational databases Structured

databases have become a main type of data format, beside written

free text Moreover, the conversion from semi-structured resources

(such as tables) to relational database is straightforward, compared

to other knowledge representation such as Ontology

We propose a hybrid approach, which utilizes the semantic

information and graph model as a converter from questions posed

in NL to structured query database (SQL) Firstly, we propose a

method for mapping NL statement to SQL elements by exploiting the

semantic information Secondly, we constructs the bipartite graph to

evaluate the remaining tokens L2S aims to require no training data

and only minimal domain knowledge Currently, in the phase of data

preprocessing, an understanding of domain is necessary but it takes

far less effort compared to building the set of rules or collecting an

annotated training dataset

Apart from this introduction, the rest of the paper is arranged as follows Section II provides a review of related works with regard

to closed-domain QA system Section III presents the architecture

of L2S Section IV proposes our approach to solve the problem

of interlingual mapping Two experiments that we tested L2S are presented in Section V Finally, Section VI summarizes the main points and discusses our future extension

II RELATEDWORKS

In this sections, we provide an insight into publications related

to closed-domain question answering system and the database that they target

Question answering has been addressed by numerous systems following a variety of approaches Current approaches are generally categorized into two main types: statistical and non-statistical ap-proaches

Statistical approaches are dedicated to the manipulation of sta-tistical models From the set of pairs between natural language and machine language, the statistical approach first builds the training

set of correct and incorrect question-answer pairs [13] The QA

problem is then transformed into a binary classification task with

label correct mapping and incorrect mapping between the question

and the answer/query Features for machine learning may be extracted

by obtaining tokens and syntactic trees of questions and queries [12] Notably, statistical approaches have only begun to receive seri-ous attention recently in some specific areas, such as the medical domain [9] With a huge set of annotated training data, statistical approaches demonstrate promising results [14] [24]

Regarding non-statistical approaches, the majority of them are rules-based approaches [5] Compared to statistical approaches, which gain attention in research-oriented works, rule-based ap-proaches find favour with real-life industrial systems The main com-ponent of rules-based approaches is a set of QA patterns [23] [8] [22] Because the patterns are written by experts with extensive do-main knowledge, a rule-based system gains promising result with a relatively small set of patterns When the set of patterns gets larger,

it is difficult to manage or improve System accuracy is not always improved when adding new rule as the new rule may conflict with the existing rules

There are also non-rule approaches such as syntactic-based anal-ysis [2], Prolog-like representation [25] and graph-based approach [21]

A Converting from a natural language question to a structured query

Database is a critical factor in QA High-level representation of data (such as Ontology) supports complex operation but the quantity

2015 Seventh International Conference on Knowledge and Systems Engineering

Trang 2

of available data in such format is sparse In many systems, the

knowledge has to be converted manually from written text For

example, if a method uses Ontology, a stage converting data to query

structured database or free context into Ontology format is inevitable

There is a trade-off between accessing database and transforming text

into database

Aqualog [15], QuestIO [7] and followed-up systems achieve

promising results, but they are not popular in the industry

be-cause of their knowledge representation Ontology OWL The

ad-hoc transformation between a written corpus or a Structured Query

Database and an Ontology is laborious Automatic processing for

the transformation is far from something which could meet the

requirements of QA systems Sometimes, it is difficult to overcome

the problem of modeling data in Ontology relations, even for human

VnQAS [19] is a followed-up system of Aqualog It is one

of the notable QA systems for Vietnamese [17] Based on the

structure of Aqualog, authors of VnQAS transformed the question

into knowledge representation via the pattern matching technique For

the demonstration, they manually created an Ontology (15 concepts,

17 attributes and 78 instances) and a set of hand-written patterns for

questions in the domain of organizational structure

III L2S ARCHITECTURE

Given a question Q relating a specific database D, L2S aims to

transform Q into a SQL query, which is in the form of SELECT

FROM WHERE The field WHERE can be either a single condition

or a joint of multiple conditions A condition is a pair between two

variables e i and e j which are compared by an operator o.

L2S consists of three modules First, Q and D are analysed by

the Preprocessing Module Then, in the Matching module, data is

processed through the Semantic Matching component and the

Graph-based Matching component respectively Finally, the Generating

Module is responsible for building a complete SQL query

A Preprocessing module

From Q and D, this module extracts all features for the

match-ing phase There are three pre-processmatch-ing components: Lmatch-inguistic

component (to analyse Q), Lexicon component (to analyse D) and

Ambiguity Solving component (to correct the input)

Linguistic component analyses Q to generate a set of tokens

W and establish the sentence attachment constraint Tools includes

Name Entity recognizer[18] (NER) and Coltech-parser1in GATE[6]

Through the use of Java Annotation Patterns Engine (JAPE)

gram-mars, we improved the performance of those tools to extend the

number of entities that it could detect Finally, an tokenization set W,

a set of name which acquired by improving named entity tags T and

the attachment constraint are the output of the Linguistic component

Two words w1and w2of W are attached if they are a pair of object

and complement or a pair of subject and question word

Lexicon Component analyses D for its elements and establishes

the database attachment constraint We define three types of element:

DB relation (associated with table name), DB attribute (associated

with table column) and DB value (associated with value) Among

those types, suppose that element e1is a DB attribute, then element

e2is attached to e1if it is a value of column e1 or it is a relation

containing e1 This component extracts all elements before comparing

them to a synonym dictionary to build an interface of database

1http://www.jaist.ac.jp/~nguyenml/NhomQA/coltechparser.zip

Ambiguity Solving component guarantees a sound input (no

ambiguity) for the matching module Taking the output of the Linguistic Component and the Lexicon Component, L2S compares all

words of W to the elements of D A word w1is evaluated ambiguous

if it matches with zero or more than two of the same type We have tried the method of ellipsis or choosing the highest possibility but it was inefficient as sometimes it leads to failure In this case, L2S first

retrieves all words {w2, w3, , w n} in which the similarity function

Ps (w i , w1) > λ There is then a possibility to engage the users by

clarifying the input with suggestion from similar words

B Matching module

The Matching module is the core of a QA system In is sometimes called by different names but serves the same purpose of interlingual mapping In L2S, it is responsible for producing a list of specific SQL elements from the output of previous step The matching module has

to support three tasks as follows

Firstly, function words in Q are identified They are the words associated with the function in the SQL query, such as question words, comparing words and linking words They are processed

differently from other words

Secondly, each input word w i is matched with the associated o j

from database o j might be a value, an attribute or a relation The matching module has to resolve the ambiguity when one input word might match multiple objects

Thirdly, each pair (w i , o j ) is linked to another pair (w k , o h ) if there is a conditional relation between w i and w k If there is no pair

(w k , o h ) given by the question, L2S has to retrieve it.

Because the Matching module is the most difficult part in QA problem, each QA system undertakes a different technique The proposed method used inL2S is described in details in section IV

C Generating module Taking all the elements list E from the Matching module, this

module is responsible for creating a sound SQL query which delivers the exact answer

Operator Action: Considering possible SQL queries which can

be drawn from the database, we introduce a new phase to assign suitable operator for each conditional pair The set of operators

between attribute and value includes = equal, <smaller, >greater,

>=greater or equal, <=smaller or equal, LIKEequal in text stringand LIKE %(X)% contain a string This component is to decide the

suitable operator for each condition pair.

SQL Generator: This component takes all output from the

previous steps and provides a concise SQL query In three main

portions of the SQL query, the SELECT portion of the query

is determined by the question type and its target The WHERE

portion contains a conjunction of attributes and their values with join

conditions in the case of multi-relation The FROM portion contains the relations for the attributes that appear in WHERE.

IV BACKGROUND OF INTERLINGUAL MAPPING ANDL2S

PROPOSALS With the main objective of qualitatively resolving: (i) the absence

of a huge annotated corpus caused by language deprivation, (ii)

no laborious effort to write the whole set of patterns, (iii) poor performance of underlying tools including named entity tagger, dependency parser and even tokenizer, this paper proposes a new

Trang 3

method in QA to handle the question-answering task by processing

in words level Our approach combines two approaches:

accurate semantic properties provided by the underlying

linguistic tools

elements/entities

A Process query based on semantic information

A semantic role is the relationship between a syntactic constituent

and a predicate [16] This defines the actual meaning of a word inside

the sentences We use named entity recognition tagger and parser to

exploit the semantic information within a sentence

Some studies have used semantic roles in answer extraction

modules, but they have all been employed as a support for delivering

meaningful answer [16] In L2S, we explicitly create a mapping

between semantic information of the question and so-called

seman-tic information of database elements The semanseman-tic information is

evaluated based on two concepts name and attachment.

First, we define name as a set of labels for word class, similar to

the definition of of a named entity The set of names is established

by manually adjusting the tags produced by named entity tagger

Meanwhile, the nature of an element in database is defined by

the object that contains it If the element is a value, then the

corresponding object is its column, which is called attribute This

is reasonable when all person’s names are in the column “person”

(or its synonyms) An important factor is that the lexical properties

of name are distributed in every language, making an independent

feature to the domain shift

Second, the attachment relationship between two words is based

on the dependency tree Two words are marked as attached if and

only if they are either connected or are children of the same node in

dependency tree In terms of database, two elements are considered

attached when they are value and attribute in the same column, or

they can be attribute and relation in the same scheme

Since the lexical property alone is against the definition of

context, if there is one entity playing the role of a subject, its object

and their relationship must be identified On the contrary, if the entity

plays the role of an object, it is important to know which is the

subject

The Algorithm 1 illustrates our method to leverage the semantic

E (elements in database) obtained from the preprocessing module,

Algorithm 1 generates a list of paired elements P

type word word order semantic link synonym

Hoàng_Việt

Hoang Viet

có

has

received

bao_nhiêu

how many

điểm

marks?

?

Person

Candidate ngày_sinh

birthday Date

Question Word

Figure 1: Semantic Processing: How many marks that Hoang Viet

who has birthday (on) 09-08-1994 received?

sinhbirthday09-08-199409−08−1994đượcreceivesbao nhiêuhowmany

điểmmarks?” Tokens “Hoàng Việt”, “09-08-1994” and “bao nhiêu”

have recognizable tags Person, Date and QuestionWord respectively.

Algorithm 1: Semantic mapping

Input: Finite sets W = {w1 , w2, , w n } and

Output: list of SQL elemets P.

1 P ← ∅

2 for i ← 1 to n do

4 T s ← synonyms of tag i

6 if T = Ø then

10 e i ← f s (w i , S)

13 P ← (w i , e i , T)

15return P

From the tag Person, the synonym Candidate is retrieved to make a

shows the semantic processing for this example

B Process query based on the graph-based model

The use of graph-based algorithms in computational linguistics could be found in statistical learning [1], question answering system [21], clustering system [26] and so on The algorithms rely on a similarity graph consisting of nodes representing linguistic features

By resolving the graph such as finding minimum spanning tree, minimum circle or maximum matching, the result will be transformed back to the original problem

Our goal is to exploit graph-based method for improving the consistency in mapping the sentence tokens to the database elements

Intuitively, a set W of tokens should be paired with the equivalent

elements in the database This means that the method has the ability

to find the correct pair while one token could be similar to a number

of database elements

We follows the idea in [21] to resolve the problem of question answering based on bipartite graph [3] L2S constructs a bipartite

graph G(V,C) which the vertex set V = W + E, the edge set C is formed by all mapping between W and E by the string similarity

method for name matching [4] The constraint in this graph is the

attachment in semantic information T herefore, the QA task is now

converted into the problem of finding the maximum matching in G.

To solve the problem in G, a directed graph G’(V’,C’) is built.

V’ contains all the nodes of V along with a source node s and a sink

node t The capacity of all nodes in V’ is 1 except for s and t, which are given by the number of nodes which could be linked C’ is set

of directed edges from s through W and E to t Each edge is given

unit capacity 1 for default

The Algorithm 2 represents our method for solving the problem

of directed graph G’ Let f be an integral flow of G’ of value k, it is straightforward to conclude that the set of edges carrying flow in f forms a matching of size k for the graph G Therefore, the solution of the maximum flow in G’ gives us a maximum matching in G Finally,

Trang 4

Algorithm 2: Maximum bipartite matching: Ford-Fulkerson

Input: Graph G’, source node S, sink node T

Output: A flow f from S to T which is maximum

1 f (x, y) ← 0foralledges(x, y)

2 while∃ path p from S to T in G f ∀ (w,e) with capacity

c f (w,e) >0 do

3 c f (p) = min{c f (w, e) : (w, e) ∈ p} for (w, e) ∈ p do

4 f (w, e) ← f(w, e) + c f (p)

5 (send flow along the path)f (e, w) ← f(w, e) − c f (p)

6 (flow of the backward path)

7 return f

the maximum flow goes through all nodes which are expected to be

in the SQL query

V EXPERIMENT

A Dataset

In this section, we present an empirical evaluation to assess the

effectiveness of L2S in Vietnamese We conduct experiments on two

sample datasets with regard to two domains Each dataset consists of

two parts: testing questions and a database

The first dataset was taken from the domain of university national

entrance exam marks (UEEM) The database was taken from top

universities in Vietnam It contains the candidates’ results along

with their information, including name, date of birth, hometown and

identification number In term of testing questions, 429 questions

were collected from human users We divide the testing set into two

types: straightforward questions (68) and general questions (361)

Straightforward questions have simple structure, indicated by the

providers The question contains a complete meaning, with one

question word and no ambiguous term In contrast, general questions

are not bounded by the definition of simplicity They were expressed

in a natural way They might have more than one question words, or

no question word Some terms were omitted, some unrelated words

were added, which leads to the problem of ambiguity However, it

guarantees the perception of natural language This follows the rule

of noisy channel: the noisy channel makes what people said different

from what they think

The second dataset was in the domain of Geography (GEO) By

translating the set of Geoqueries880 [27] questions into Vietnamese,

we collect an original set of 880 questions2 All proper American

names were substituted by the corresponding names in Vietnamese

We filtered out similar questions to keep 498 distinct questions for

testing We employed three translators working separately Then they

discussed to generate one final set of testing question

Both datasets were collected carefully We not only make the

testing experiment for our approach, but also create two standard

testing dataset for QA in Vietnamese language From the domain, we

manually created a mapping table between the terminology of column

and the possible expression in question The table was created based

on the common knowledge towards each domain We keep the same

table for two experiments

To measure the performance of L2S, we build a baseline with the

graph-based approach, which treats the database as a dictionary [21]

Without the technique for solving ambiguity, the baseline ignores all

questions which are intractable - question with ambiguous words

2http://www.cs.utexas.edu/users/ml/geo.html

or misplace in attachment feature The baseline directly transforms the question and database elements into graph, then extracts the query from the result of maximum bipartite matching algorithm By comparing L2S with the Baseline, we will evaluate the effectiveness

of our approach with semantic information That is the difference of our hybrid approach and a standalone approach

If a statement is processed, an answer is output from L2S or

the baseline The answer is correct if its query derives the exact

information that has been asked One question may accept several queries, but it has a unique result where they are evaluated by human annotators The annotators perform two actions First, they execute the query Then, if the query are successful executed, the result is compared with the precise answer provided by experts The answer

is correct if and only if the query is executable and the obtained

result and the precise answer are identical If the obtained result is

different from the answer, it is incorrect If the query is not delivered

or unexecutable, the answer is invalid.

B Experiment 1

We first analyze the testing sets (68 straightforward and 361

gen-eral) Most questions belong to three main categorizes: entity - asking about specific subject/object, number - asking for quantity/ranking of

a group of subject and ratio - asking for the proportion.

Table I shows the difference between two testing sets of UEEM While the majority of straightforward questions are entity (70.59%), general questions are divided more evenly Especially, 9.14% of all the general questions are statement sentences, which do not contain question word This would lead to error with hand-written patterns designed to capture the question structure

The database of UEEM contains three types of information:

relation (names of the university), attributes (“identification number”,

“name”, “marks” and title of other information) and value (value of each instance in the database) It is common that two candidates receive the same mark in [0-10] Therefore, one value may belongs

to different instances, posing a high ambiguity in this database

To guarantee a sound input, we implemented the prepossessing module to enhance this performance in QA task The source code and all updates of the completed system have been published online3 Table II illustrates the precision, recall, F-measure and accuracy

of the two testing sets Precision is the percentage of correct answer

in total answer Recall is the percentage of correct answers in the set

of correct answer and no answer The accuracy is the percentage of correct answers in total questions L2S answers all questions, while

the Baseline only answers tractable questions Therefore, the recall

of L2S is always 100%

Table II: Experiment with UEEM dataset

Sim test Precision Recall F-measure Accuracy

Gen test Precision Recall F-measure Accuracy

The precision of Baseline remains stable around 76% However,

it refuses to answer intractable questions, which was abundant in the

general sets Therefore, the recall and F-measure drops considerably

3http://sourceforge.net/projects/l2s/

Trang 5

Table I: simple and general test set of UEEM dataset

Testing sets Total Entity question Number question Ratio question Non-question

500 highest IDF words Named Entity Numeric Proper noun Question words Other types

L2S have achieve a high result with accuracy of 91.13% in the

general test, compared to 21.89% of the baseline There are three

main reasons for this distinction

• Linguistic tools failure, including the tokenizer and parser A

minor incorrect positions in the parser leads to intractable

question, this is resolve by using semantic information in

L2S

• L2S recognizes compare words like “lớn hơn” greater, “nhỏ

hơn”smallerand so on as a named entity The Baseline could

not find an equivalent element for them in the database, treat

them as intractable.

• General questions contain unknown words, either stop words

or words not in database

However, some mistakes in the tokenization and question words

processed lead to incorrect answers in both L2S and the baseline

C Experiment 2

A brief analysis of the second dataset shown that the proportion

of named entity in 500 words which has the highest IDF value is

81.3% Similar to the UEEM dataset, named entities in the second

experiments are mostly proper name of locations, rivers, mountain,

question words and comparing words They plays an important role

in question answering task

The word frequency in this dataset is not as high as the first

dataset Given a random element, such as name of a river, height of

a mountain or population of a province, there is no boundary of its

value The number of two elements which share the same value is

smaller than the first database In other words, this database is less

ambiguous than the first one

Nevertheless, this dataset provides a new challenge In the

database, 35.6% of value (including mountains, lakes and rivers) have

strange name A strange word originates from the local language,

such as “T’nưng”(lake), “phan xi pan” (mountain), “Xi giơ Pao”

(mountain) and so on They are localized languages, leading to failure

of linguistic tools In our prior analysis, all Vietnamese available

linguistic tools cannot handle these names

Table III lists results of Baseline and L2S on the testing set Since

we are interested in the performance with regard to new domain, we

evaluate the F-measure of both systems

Table III: Experiment with Geoquery dataset

Gen test Precision Recall F-measure Accuracy

The baseline results show that, for Geographical questions and

Baseline system, the mis-matching queries are less common They

make the precision of Baseline higher, 83.3% This is not surprising

if we remember that the UEEM database was more ambiguous The Baseline results are interesting because they indicate that the graph-based method is somehow effective The less ambiguous the domain

is, the more efficient it achieve However, it ignores the majority of

questions due to intractable feature This method alone cannot be

used for actual system

Next we measure the performance of L2S in the new domain Be-cause the configuration of L2S between two experiments are mostly the same, we compare the results and put forward our evaluation The precision and F-measure of L2S in the second experiment is lower than the first one (general test set and simple test set) The main reason is that the failure of tokenization and named entity recognition

As all other linguistic system, L2S has one crucial point is that the tokenizer has to works precisely

Overall, the performance of L2S is promising L2S has proved it robustness in two sample datasets With the same linguistic tools for

an under-resourced language like Vietnamese, L2S does not require annotated training data nor a set of hand-crafted rules Comparing the result with the first experiment, we can say that L2S has demonstrated it effectiveness in dealing with different domain This proved the effectiveness of our hybrid approach, combining semantic information and graph-based model

D Discussion

The experiments are conducted in two different datasets with

no available annotated training set or hand-crafted rules Human intervention was minimized to the common knowledge of the do-mains The result strengthens our hypothesis that it is viable to build

a reliable question answering system in under-resourced languages Two experiments were conducted with the same configuration of system No extensive observation of the domain for hand-written rules

or annotated features is require The whole workload is significantly smaller than the time and effort spending on writing structured rules (in a typical rule-based system)

We sampled incorrect results from all experiments For each incorrect answer, it is analyzed to find out the main cause

Table IV: The main reason of errors

Linguistic tools failure 38.23% 4.86%

Question words ambiguous 4.15% 4.15%

Table IV shows a significant reduce in three main causes of error Problems that the Baseline refuses to answer was resolved by L2S

On the one hand, L2S leverage the precise output of one linguistic tool to overcome the failure of others For example, question is

“Thí sinh nào có ngày sinh là 06-03-1990?” The dependency parser failed to identify the relation between “ngày sinh” and “06-03-1990” However, L2S detects the word “06-03-1990” has the

Trang 6

name tag “date”, which is mapped to synonym of “ngày sinh”.

Even if the word “ngày sinh” is removed from question, L2S still

retrieves appropriate pair for “06-03-1990” and successfully deliver

the answer We are developing the method to overcome the limitation

of question classification based on lexical tag

VI CONCLUSION

In this paper, we present our hybrid approach for developing QA

systems in specialized domains L2S contains one novel method of

semantic processing and one followed up method in graphed-based

processing It overcomes the weakness of current approaches with

regard to the lack of training data, domain observation and week

underlying linguistic tools

L2S exploits information from underlying tools to select the

reli-able semantic information Then, the graph-based processing handles

the remaining tokens and entities between two languages (natural

language and SQL languages) By combining two approaches, L2S

alleviates the heavy dependence on linguistic applications and domain

knowledge

The experiments indicate effectiveness of the hybrid method The

first experiment measured to what extent the presented approaches

are useful to answer straightforward questions and tricky questions

The second experiment demonstrates the robustness of L2S across

different domains Results show that L2S maintains the accuracy

over different domains and requires a small workload to switch the

domain

ACKNOWLEDGMENT This work is supported by the Nafosted project 102.01-2014.22

REFERENCES [1] Alexandrescu, A., Kirchhoff, K., 2009 Graph-based Learning for

Statistical Machine Translation, in: Proc of Human Language

Tech-nologies: The 2009 Annual Conference of the North American Chapter

of the Association for Computational Linguistics, Association for

Computational Linguistics, Stroudsburg, PA, USA pp 119–127

[2] Androutsopoulos, L., 1995 Natural Language Interfaces to Databases

- an Introduction Journal of Natural Language Engineering 1, 29–81

[3] Asratian, A.S., Denley, T.M.J., H¨aggkvist, R., 1998 Bipartite Graphs

and Their Applications Cambridge University Press, New York, NY,

USA

[4] Cohen, W.W., Ravikumar, P., Fienberg, S.E., 2003 A Comparison of

String Distance Metrics for Name-Matching Tasks, in: Proc of

IJCAI-03 Workshop on Information Integration, pp 73–78

[5] Costa, P., Almeida, J., Pires, L., van Sinderen, M., 2008 Evaluation

of a Rule-Based Approach for Context-Aware Services, pp 1 –5

[6] Cunningham, H., Maynard, D., Bontcheva, K., Tablan, V., 2002 GATE:

A Framework and Graphical Development Environment for Robust

NLP Tools and Applications, in: Proc of ACL, pp 168–175

[7] Danica Damljanovic, V.T., Bontcheva, K., 2008 A Text-based Query

Interface to OWL Ontologies, in: Proc of the Sixth International

Con-ference on Language Resources and Evaluation, Marrakech, Morocco

[8] Dat Quoc Nguyen, Dai Quoc Nguyen, S.B.P., Pham, D.D., 2011 Ripple

down rules for part-of-speech tagging, in: Proc of the 12th International

Conference on CL and Intelligent Text Processing, pp 190–201

[9] Demner-Fushman, D., Lin, J., 2007 Answering Clinical Questions

with Knowledge-Based and Statistical Techniques Comput Linguist

33, MIT Press, Cambridge, MA, USA pp 63–103

[10] Dien, D., Kiem, H., 2003 POS-Tagger for English-Vietnamese

Bilin-gual Corpus, in: Proc of the HLT-NAACL 2003 Workshop on Building

and using parallel texts Association for Computational Linguistics,

Stroudsburg, PA, USA pp 88–95

2008 Natural Language Database Interface for the Community Based Monitoring System, in: Proc of the 22nd Pacific Asia Conference

on Language, Information and Computation, De La Salle University (DLSU), Manila, Philippines pp 384–390

[12] Giordani, A., 2008 Mapping Natural Language into SQL in a NLIDB, in: Proc of the 13th international conference on Natural Language and Information Systems, Springer-Verlag, Berlin, Heidelberg pp 367–371 [13] Giordani, A., Moschitti, A., 2010 Semantic Mapping between Natural Language Questions and SQL Queries via Syntactic Pairing, in: Proc of the 14th International Conference on Applications of Natural Language

to Information Systems, Springer-Verlag, Berlin, Heidelberg pp 207– 221

[14] Kusumoto, T., Akiba, T., 2012 Statistical Machine Translation without Source-side Parallel Corpus Using Word Lattice and Phrase Extension, in: Proc of the Eight International Conference on Language Resources and Evaluation (LREC12), European Language Resources Association (ELRA), Istanbul, Turkey

[15] Lopez, V., Uren, V., Motta, E., Pasin, M., 2007 Aqualog: An ontology-driven question answering system for organizational semantic intranets Web Semant 5, 72–105

[16] Moreda, P., Llorens, H., Saquete, E., Palomar, M., 2011 Combining semantic information in question answering systems Inf Process Manage 47, 870–885

[17] Nguyen, Dat Tien, Hoang, Duc Tam and Pham, Son Bao, 2012

A Vietnamese Natural Language Interface to Database, Sixth IEEE International Conference on Semantic Computing, ICSC, Palermo, Italy 130–133

[18] Nguyen, D.B., Hoang, S.H., Pham, S.B., Nguyen, T.P., 2010 Named Entity Recognition for Vietnamese, in: Proc of the Second international conference on Intelligent information and database systems: Part II, Springer-Verlag, Berlin, Heidelberg pp 205–214

[19] Nguyen, D.Q., Nguyen, D.Q., Pham, S.B., 2009a A Vietnamese Question Answering System, in: Proc of KSE, pp 26–32

[20] Nguyen, P.T., Vu, X.L., Nguyen, T.M.H., Nguyen, V.H., Le, H.P., 2009b Building a large syntactically-annotated corpus of Vietnamese, in: Proc of the Third Linguistic Annotation Workshop, Association for Computational Linguistics, Stroudsburg, PA, USA pp 182–185 [21] Popescu, A.M., Etzioni, O., Kautz, H., 2003 Towards a theory of nat-ural language interfaces to databases, in: Proc of the 8th International Conference on Intelligent User Interfaces, ACM, New York, NY, USA

pp 149–157

[22] Saxena, A.K., Sambhu, G.V., Subramaniam, L.V., Kaushik, S., 2007 IITD-IBMIRL System for Question Answering using Pattern Match-ing, Semantic Type and Semantic Category Recognition, in: Proc of The Sixteenth Text REtrieval Conference, 2007, Gaithersburg, Mary-land, USA

[23] Sneiders, E., 2002 Automated Question Answering Using Question Templates That Cover the Conceptual Model of the Database, in: Andersson, B., Bergholtz, M., Johannesson, P (Eds.), Natural Language Processing and Information Systems, Springer Berlin Heidelberg pp 235–239

[24] Suzuki, J., Sasaki, Y., Maeda, E., 2002 SVM Answer Selection for Open-Domain Question Answering, in: Proc of the 19th International Conference on Computational Linguistics, , Stroudsburg, PA, USA pp 1–7

[25] Waltz, D.L., 1978 An English Language Question Answering System for a Large Relational Database Commun ACM 21, pp 526–539 [26] Wieling, M., Nerbonne, J., 2010 Hierarchical spectral partitioning of bipartite graphs to cluster dialects and identify distinguishing features, in: Association for Computational Linguistics, Stroudsburg, PA, USA

pp 33–41

[27] Wong, Yuk Wah, Mooney, Raymond, 2007 Learning Synchronous Grammars for Semantic Parsing with Lambda Calculus, Association for Computational Linguistics, Prague, Czech Republic pp 960–967

Định dạng
Số trang	6
Dung lượng	579,55 KB