Integrating Natural Language Processing And Web Gis For Interactive Knowledge Domain Isualization

ABSTRACT OF THE THESIS Integrating Natural Language Processing and Web GIS for Interactive Knowledge Domain Visualization by Fangming Du Master of Science in Geography with a Concentrati

Trang 1

FOR INTERACTIVE KNOWLEDGE DOMAIN VISUALIZATION

of the Requirements for the Degree

Master of Science in Geography

Trang 4

DEDICATION

To my parents and my family

Trang 5

ABSTRACT OF THE THESIS

Integrating Natural Language Processing and Web GIS for Interactive Knowledge Domain Visualization

by Fangming Du Master of Science in Geography with a Concentration in

Geographic Information Science San Diego State University, 2014

Recent years have seen a powerful shift towards data-rich environments throughout society This has extended to a change in how the artifacts and products of scientific

knowledge production can be analyzed and understood Bottom-up approaches are on the rise that combine access to huge amounts of academic publications with advanced computer graphics and data processing tools, including natural language processing Knowledge

domain visualization is one of those multi-technology approaches, with its aim of turning domain-specific human knowledge into highly visual representations in order to better

understand the structure and evolution of domain knowledge For example, network

visualizations built from co-author relations contained in academic publications can provide insight on how scholars collaborate with each other in one or multiple domains, and

visualizations built from the text content of articles can help us understand the topical

structure of knowledge domains

These knowledge domain visualizations need to support interactive viewing and exploration by users Such spatialization efforts are increasingly looking to geography and GIS as a source of metaphors and practical technology solutions, even when non-

georeferenced information is managed, analyzed, and visualized When it comes to deploying spatialized representations online, web mapping and web GIS can provide practical

technology solutions for interactive viewing of knowledge domain visualizations, from panning and zooming to the overlay of additional information

This thesis presents a novel combination of advanced natural language processing –

in the form of topic modeling – with dimensionality reduction through self-organizing maps and the deployment of web mapping/GIS technology towards intuitive, GIS-like, exploration

of a knowledge domain visualization A complete workflow is proposed and implemented that processes any corpus of input text documents into a map form and leverages a web application framework to let users explore knowledge domain maps interactively This

workflow is implemented and demonstrated for a data set of more than 66,000 conference abstracts

Trang 6

TABLE OF CONTENTS

PAGE

ABSTRACT vi

LIST OF TABLES x

LIST OF FIGURES xii

ACKNOWLEDGEMENTS xv

CHAPTER 1 INTRODUCTION 11

Problem Statement 13

Objectives and Intellectual Merit 15

2 LITERATURE REVIEW 17

Knowledge Domain Visualization 17

Web GIS 18

Spatialization 19

Topic Modeling 20

3 RESEARCH DESIGN 21

Functionality Design 21

Spatial Concepts 21

Real World 21

Semantic World 21

From Concepts to Functionality 23

Workflow Design 24

Web GIS Application Design 26

4 IMPLEMENTATION 28

Workflow 28

Text Processing Workflow 28

Data Preprocessing 28

LDA Topic Modeling 31

SOM Training and Clustering 36

Trang 7

Programming Environment 37

GIS Processing Workflow 38

Integrating Workflow with Web GIS 39

Web GIS Implementation Framework 39

Web Inferencing Services 40

Mapping and Geoprocessing Services 43

Web User Interface 43

Evaluation of Performance 46

5 CONCLUSION 48

Results Summary 48

Limitations and Future Studies 49

REFERENCES 51

A FILTERED STOP TOPICS 54

Trang 8

LIST OF TABLES

PAGE

Table 1 Semantic Generalization (Fabrikant and Skupin 2005) 21

Table 2 Functions for Non-geographic Information Visualization in GIS 23

Table 3 Dataset Format 29

Table 4 LDA Topic Model Training Output Files 34

Table 5 Input and Output Data for Inferencing Services 41

Table 6 Filtered Out Stop Topics, Each Stop Topic Consists of Several Topic Phrases 55

Trang 9

LIST OF FIGURES

PAGE

Figure 1 Google Maps technology deployed for knowledge domain visualization 12

Figure 2 Perplexity evaluations of different computational language models (Blei 2003) 25

Figure 3 Data processing workflow 26

Figure 4 Web GIS application framework 26

Figure 5 NLP serivces 27

Figure 6 XML Processing for PDF Format Data 29

Figure 7 XML Schema 30

Figure 8 Data Content Preprocessing 31

Figure 9 Perplexity Computation Using Mallet 33

Figure 10 Perplexity Graph for Our Model 33

Figure 11 Trained SOM represented as Shape File Panels (a) and (b) show the SOM neurons as hexagons at zoom levels Panels (c) and (d) contain renderings of component planes, i.e the distribution of the weights of one particular attribute across the two-dimensional neuron geometry .37

Figure 12 GIS Processing Workflow 38

Figure 13 SOM Polygon Dissolve and Labeling (a) represents the dissolved polygons from SOM neurons (Figure 11 (a)) (b) adds labels to cluster polygons .39

Figure 14 Data and Process Flow in Web GIS 40

Figure 15 Inferencing Services for Projection Functionality 41

Figure 16 User Interface Components 44

Figure 17 Projection as Point 45

Figure 18 Projection as Overlay Map Layer 45

Figure 19 Time Consumption in Inferencing Service with Three Test Groups of Data 46

Figure 20 Time Consumption in Geoprocessing Service 47

Trang 10

ACKNOWLEDGEMENTS

I would like to thank the members my thesis committee Dr Skupin, Dr Tsou, and Dr Eckberg for their help, support, and interest in my thesis work Especially, I would like to thank Dr Skupin for serving as my major professor and graduate advisor This thesis would not have been possible without his great amount of support I am very grateful to him for giving me invaluable advice and continuous guidance on my research project In addition I would like to acknowledge and thank David Mckinsey and Marcus Chiu for their great technical support I would also like to thank Raymond Lee for his useful advise and help in

my thesis writing I extend my gratitude to my colleagues and friends Jay Yang, Shuang Yang, Marilyn Stowell for their support Finally, last but not the least, I would like to address

a special thank you to my parents for the love and care they have continually given me all these years

Trang 11

CHAPTER 1

INTRODUCTION

Visualization is the process of making a phenomenon visible or enabling the forming

of a mental image of it Through different visualization products, human beings are able to see and thus understand abstract information more efficiently For example, on a subway map, people can actually see the whole transportation system and understand how to transfer between different lines to get to a destination

Information visualization is the use of computer-supported, interactive, visual

representations of abstract data to amplify cognition (Card & Shneiderman 1999) With more and more information available online nowadays through computers and the Internet, it has become much more difficult to understand the huge information or even produce any forms

of visualization from it With computational algorithms, information visualization can

represent huge amount of information visually for human beings to better understand them and explore them to create new knowledge (Card, Mackinlay & Shneiderman 1999) Science

is rapidly developing in different disciplines every year with new publications; it has become almost impossible to understand the whole structure of science or even one knowledge

domain of it Techniques and theories in information visualization are utilized visualization

of knowledge (Börner, Chen & Boyack 2003; Börner 2010) For this particular type of

knowledge, it represents the opinions, values, and perspectives from scientific disciplines, which is communicated in scientific journals and articles It can give an overview of a whole discipline and its development from the past to the future, thus further guide the professional groups in more fruitful directions (Börner, Chen & Boyack 2003 ; Boyack, Klavans &

Börner 2005)

On the other hand, for the visualization part, cartography has theories and practices dealing with the visualization of geographic information And spatial metaphors have been used in the information visualization to utilize humans’ spatial cognition Spatialization that emerges as the new research frontier in recent decade studies how to display high

dimensional data in lower dimensional space It integrates computational algorithms that deal

Trang 12

with dimension deduction and spatial concepts and cartographic principles that help design the lower-dimensional display space Spatialization is applicable for the knowledge domain visualization and has the potential to integrate more cartographic approaches (Skupin,

Biberstine & Börner, 2013) However, interaction as one of the most important aspects in information visualization, it cannot be achieved with traditional static cartographic principles

in spatialization for knowledge domain visualization (Skupin, Biberstine & Börner, 2013) Although some relatively simple online mapping technologies have been used for non-

geographic knowledge domain visualization, such as Google Maps (Fig 1), these tend to provide only very limited user interaction and functionality

Figure 1 Google Maps technology deployed for knowledge domain

visualization

Meanwhile, more advanced web GIS solutions are now widely used to provide

interactive web mapping applications, but have traditionally focused solely on geographically referenced data This study will investigate whether and to which degree web GIS technology

Trang 13

can be employed in interactive knowledge domain visualization and how geographic

concepts and text mining techniques can be usefully combined in the process

PROBLEM STATEMENT

Skupin (2002) discusses the creation of a base map using VSM (vector space model) and SOM In that spatialization approach, the VSM consists of vectors containing term counts for each document This is the high-dimensional model that then undergoes

dimensionality reduction using SOM However, there are certain drawbacks to this use of traditional VSM:

1 Scalability Large document collections will result in vectors whose high

dimensionality may make SOM training more difficult;

2 Sparseness Vectors in the VSM tend to be very sparse, since any particular document

vector will record a count of zero for most terms;

3 Term order The order in which the term appears in the documents is lost in the

vector representation, at least when using unigram counts While use of multi-part grams would be possible, that can increase the already high dimensionality of the VSM even further;

n-4 Semantic sensitivity Documents with related content, but differences in actual

vocabulary (e.g., synonyms), may not display sufficiently strong similarity;

5 Stemming effects Though stemming of the original terms will lower the model

dimensionality, it may result in "false positive matches" for stems that originate from terms with significantly different meaning

One key goal of this thesis is to explore the feasibility of replacing the VSM approach with a

topic model approach, prior to SOM training Topic models – specifically latent Dirichlet

allocation (LDA) – treat each document as a mixture of topics derived from a collection of documents (Blei, 2003)

Another problem is the lack of a comprehensive workflow for the creation of base maps from text documents, as opposed to processing steps occurring in several, relatively separate, segments (Skupin 2002, 2004), which makes it difficult to replicate the process for new document collections Combination of an existing Java library for topic modeling and a newly developed Java library for SOM training creates the possibility of a seamless

processing workflow for the creation of base maps

Finally, current knowledge domain visualizations do not provide sufficiently high degrees of interaction to allow exploratory visualization by users Instead, most of the more

Trang 14

intricate knowledge domain visualizations are image- or paper-based, with graphic zooming

as the only interactive operation supported The technological base of web GIS should

provide a basis for more advanced interaction, since it is founded on a mature theoretical and practical framework for managing, analyzing, displaying geographic data online

In addressing these various problems of contemporary knowledge domain visualization, the following research questions are pursued in this thesis:

1 What are some of the fundamental spatial concepts in GIS that may be of potential use

for knowledge domain visualization? How could these be used?

This question will identify certain fundamental concepts in GIS and apply them to the visualization of a high-dimensional space in which the knowledge domain is

represented For example, the overlay operation can be used in GIS to project any

point/line/area geographic features onto a base map based on their geographic coordinates and we want to know whether and how this concept and GIS

technique is applicable to knowledge domain visualization

2 How can one develop a domain base map from a large document corpus based on NLP and dimensionality reduction techniques?

While the classic vector space model (VSM), in conjunction with the

self-organizing map (SOM) method, has been successfully used for domain mapping, the more advanced NLP approach of latent Dirichlet allocation (LDA) has been speculated to have advantages over a classic VSM approach, both in

computational performance and in terms of how meaningful the resulting dimensional space is This study will investigate how the LDA topic model can be adapted and combined with SOM dimensionality reduction towards the creation

high-of detailed domain base maps

3 How and to what degree can web GIS technology be utilized for interactive knowledge

domain visualization?

Trang 15

This question is intended to identify, adapt, and implement specific functions in a web GIS environment, such that the spatial concepts of interest (see question 1) can be operationalized in the context of the domain base map (see question 2) To that end, a prototype web application will be implemented that combines web GIS technology with live operations on a high-dimensional knowledge space and its two-dimensional projection

OBJECTIVES AND INTELLECTUAL MERIT

The overall objective of this research is to create an integrated workflow and

framework to utilize LDA topic modeling, SOM dimensionality reduction, and web GIS to create interactive knowledge domain visualization from any domain specific large text

corpus The following specific objectives are pursued:

a) Java program modules are generated that can preprocess a text corpus,

iteratively create an LDA topic model, and perform SOM training in the same programming environment

b) GIS-based modules are created that transform the output of the LDA/SOM process into data structures compatible with GIS software, such that the base map can be represented in GIS

c) Trained model and base map are the content drivers for web mapping and web processing services that provide both interactive online domain mapping and live NLP inference

The intellectual merit of this research rests on a novel, iterative approach to LDA topic modeling and the use of web GIS technology to implement advanced spatial operators for interactive high-dimensional visualization and inference

Compared to traditional VSM, the LDA topic model is meant to result in a dimensional representation that is computational more efficient and also a potentially more meaningful representation of the document corpus (Skupin, Biberstine & Börner, 2013) This

lower-is one of the first studies to explore thlower-is combination of LDA topic modeling with SOM and the first study to create a detailed knowledge domain base map through this process

Meanwhile, the technological solutions and workflows proposed, developed, and

Trang 16

documented in this study will serve as a template for future visualizations of other knowledge domains

Web GIS technology has become very popular and widely adopted during the

previous decade From simple web mapping, as found in Google Maps, it has extended to

web based geo-processing services providing much of the functionality found in stand-alone GIS software However, its underlying spatial concepts and analytical capabilities have typically only been applied to geographically referenced information This study represents the first practical exploration of an extensive set of spatial concepts in a high-dimensional framework and its operationalization for large-scale knowledge domain visualization in a web GIS environment

Trang 17

CHAPTER 2

LITERATURE REVIEW

As discussed in the previous chapter, this research studies the combination of LDA topic model, SOM and Web GIS for interactive knowledge domain visualization This review discusses the knowledge domain visualization in the first section and Web GIS in the second section The third section develops the review of the spatialization method; namely, the metaphor of spatial concepts used in information visualization Finally, this chapter ends with discussions of topic modeling

KNOWLEDGE DOMAIN VISUALIZATION

Knowledge domain visualization aims at the interactive visual representation of knowledge domains (Börner, Chen & Boyack 2003) Knowledge domains can be considered

as abstract spaces within which different knowledge objects can be represented For a

specific discipline, such as medical science or geography, it can be defined as one knowledge domain that are made up of scientific journals, articles, and professional groups in that

discipline

Knowledge domain visualization is not a new field Price (1965) introduced a method

to look into the development of science by analyzing scientific papers He examined the changes of references and citations of scientific papers over years using statistical analysis and found out that changes in citations can indicate how scientific fields grow

With the development of computer science and GIS, knowledge domain visualization expands the old citation based analysis into a new research field, utilizing all kinds of data and visualization techniques to reveal the development of scientific knowledge (Börner, Chen & Boyack 2003) They introduced a general framework in doing knowledge domain visualization and identified two fundamental problems in knowledge domain visualization One of the problems is the need to project high-dimensional data to a two-dimensional

display space; and the other one is the conflict between large amounts of data and limited space and resolution

Trang 18

Besides the visual representation of knowledge domains, interaction as one of the most important elements in information visualization also emerges as an important issue in the knowledge domain visualization Shiffrin and Börner (2004) also mentioned that without interaction the visualization of knowledge domains will be of little use

WEB GIS

In traditional cartography and GIS, theories and practices have been applied to static representation of geographic information as maps, including projection, generalization, and map design (Dent 1999; Robinson et al 1995) After the invention and evolutionary growth

of Internet, several changes have taken place in cartography, especially for the disseminating and interactive access of geographic information (Kraak & Brown, 2001)

The widespread accessibility of the Internet around the world makes, sharing and providing geographic information and services easier and more powerful Smith & Frew (1995) introduced a new undergoing project, Alexandria Digital Library, which is one of the earliest distributed systems, aimed at providing online services for sharing geographic

information Its functionality included supporting access, providing queries, storage, and management Green and Bossomaier (2002) introduced the idea about distributing

geographic information system (GIS) into online GIS services They proposed a two-tier framework, including server-side and client-side This was a promising framework which can provide more services, including geographic data sharing and geographic data analysis Then Tsou (2004) introduced a Web-GIS architecture with three levels of geographic information services: data archive, information display and spatial analysis The first level is a web-based data warehouse, which can be seen as an extension of the earlier Alexandria Digital Library The second and third level services build on the first level, providing user interactive

services, which is similar to Green’s two-tier framework Tsou (2004) also provided a

prototype implemented of the three levels architecture

Interaction is the second big change web brings to cartography In traditional

cartography and GIS, the contents of the maps are static and the quality of the maps mostly depend on the data and professional cartographers However, after the launch of Google Maps and Google Earth at 2005, people have much easier access to geographic information than ever before and they can easily find the geographic information they want by dynamic

Trang 19

requests there are many other new websites emerging based on the idea about “Web 2.0” (Haklay, Singleton & Parker 2008) Goodchild (2007) introduced the emerging phenomenon

of sharing geographic information on the web, like Wikimapia (www.wikimapia.org, online editable map, easy access to anyone to mark or describe any sites on the earth) and Flickr (www.flickr.com, upload and locate photographs on the Earth’s surface by longitude and latitude) So the users not only have more freedom in the ways of viewing geographic data, but also gradually become the providers of geographic information Tsou (2011) identified this new change and tried to redefine web cartography, emphasizing the trend toward a user-centered design, user-generated content and ubiquitous access

SPATIALIZATION

In geographic information science and cartography, there are several concepts which have been applied to other research fields as spatial metaphors (Kuhn and Blumenthal 1996; Skupin and Buttenfield 1996, 1997), such as location, distance and scale

Fabrikant (2000, 2001) used region, distance and scale as spatial metaphors for the visualization and interactive exploration of digital libraries Different documents in the library are displayed in a 2-dimension space Distance represents the similarity between documents, thus similar documents in the library would be displayed closer to each other Scale represents the level of details in the hierarchy of documents and region is used to aggregate similar items

Beyond these spatial metaphors, Skupin (2000, 2002) applied cartographic

approaches, such as generalization, feature labeling, and map design to the visualization of text documents These cartographic approaches can tackle some issues that spatial metaphors cannot help, such as dealing with complexity of large amount of features

The study of applying these spatial metaphors and cartographic approaches to other types of data visualization, especially for high-dimensional data, forms a new research

frontier over the recent years as spatialization, “systematic transformation of high

dimensional datasets into lower-dimensional, spatial representations for facilitating data exploration and knowledge construction” (Skupin and Fabrikant 2008) Thus it can transform large amount of unstructured and non-georeferenced high-dimensional data into organized

Trang 20

geographic space Then other spatial concepts and techniques can be applied to this display space to utilize humans’ spatial cognition to understand the original datasets

to be dropped in processing a large amount of data (Skupin, Biberstine & Börner, 2013) and VSM is sensitive to the vocabularies

Topic model is a new type of statistical model for discovering abstract topics from document corpus Given that one document is about a particular topic, one would expect that the particular words describing that topic would appear in that document more frequently Latent Dirichlet allocation (LDA) is the most common topic model currently in use In LDA, one topic is defined as a distribution over a fixed vocabulary and each document is a mixture

of topics with different proportion (Blei, Ng & Jordan, 2003) It allows one document to have

a mixture of topics Thus one document that exists in a semantic space is defined by different topics as dimensions

Trang 21

CHAPTER 3

RESEARCH DESIGN

The main goals of this research are to propose a workflow for processing text

documents to create knowledge domain maps and to propose a web application framework for interactive exploration of the knowledge domain maps online To accomplish these goals, functionality, processing workflow and web GIS application design are introduced in this chapter

FUNCTIONALITY DESIGN

This section identifies the main conceptual building blocks in GIS and applies some

of them into the design of functionalities that are potentially applicable for visualization of high-dimensional non-geographic information

Table 1 Semantic Generalization (Fabrikant and Skupin 2005)

Trang 22

Using this framework, we can identify more fundamental concepts in GIS, especially

in web GIS and apply them to an interactive visualization of the knowledge domain of

geography Semantic generalization is applied to the data using the LDA topic model and self-organizing maps (SOM) The visual representation of the 2-dimensional display space is rendered with map symbols and map design principles

Golledge (1995) presented a set of primitives – identity, location, magnitude, time –

as the building blocks of spatial concepts Then he identified three different levels of spatial concepts based on these primitives First level concepts are called derived concepts: distance, angle and direction, sequence and order, connection and linkage Second level is spatial distribution, including boundary, density, dispersion, pattern and shape Third level is higher order derived concepts: correlation, overlay, network, hierarchy and other concepts

Dibiase et al (2006) developed a comprehensive body of knowledge for geographic information science and technology This provides some basic concepts from the science and technology perspective The main concepts for representation and analysis of geographic information include geometric measures (distance, direction, shape, area and connectivity), basic analytical operations (buffers, overlays, neighborhoods, and map algebra), elements of geographic information (discrete entities, events and processes, fields in space and time, and integrated models), domains of geographic information (space, time, relationships between space and time, properties)

Janelle and Goodchild (2011) identified several fundamental spatial concepts from more recent research in geographic information science They are location, distance,

neighborhood and region, networks, overlays, scale, spatial heterogeneity, spatial

dependence, and objects and fields This group of concepts forms the basis for the

contemporary spatial analysis, visualization

A comparison of these different approaches to the delineation of spatial concepts yielded the following fundamental concepts in GIS that may be of particular use for the representation of semantic spaces: identity, location, distance, neighborhood and region, connection, scale, time, objects and fields, overlays, buffers, networks

Trang 23

From Concepts to Functionality

Based on the spatial concepts just mentioned, we need to derive functions that can be implemented in a web GIS system and are relevant to knowledge domain visualization

However, since a knowledge domain is here defined as existing in a high-dimensional space, some of the functions available in web GIS need to be extended to support high-dimensional approaches For example, in order to implement a buffer function, one first needs to compute the proximal region in a high-dimensional space and then project it to the 2-dimensional

display space The following table shows all the definitions of the functions in our

application

Table 2 Functions for Non-geographic Information Visualization in GIS

Proximal Region High-dimensional

buffering

The input geometry/text is buffered by calculating each offset in the semantic space and then represented in the 2-D display space as regions

Neighbor Find features within

distance

The distance is computed in semantic space between the geometry/text input and other features in semantic space

Projection

Project as discrete object

The input text is computed by topic models to get its coordinates in semantic space and then projected to the 2-D display space and represented as

point/field

Project as continuous field

Different points would be interpolated between two input point/text Then the path between the interpolated points would be represented in the display space

as the shortest path

Trang 24

Transect View profile graph View different term weights between

different points in the display place as

graphs

WORKFLOW DESIGN

The original dataset that is used in this research is a collection of abstracts submitted

to the AAG for their annual meetings over 20 years, which has around 66,000 records Each abstract consists of around 250 words of text, including author information and keywords

We would like to employ the LDA topic model to extract topics from this collection of abstracts corpus and apply SOM training to the abstracts based on the topics from LDA topic model

The data come in various file formats and data structures across the 20-year range so extensive pre-processing will be necessary to represent each original AAG abstract within a single XML schema Then the title, key words and abstract text of each abstract must be computed in Mallet (McCallum, 2002) using LDA topic model to get a basic topic loading that describes the whole collection of abstracts using topics

Though the LDA topic model intuitively discovers a range of topics, some of these

"topics" will be of a syntactic or procedural type, with little value for semantic/topical

distinctions in the knowledge domain For example, some of the topics exported by the model are characterized by phrase like “paper examines, paper explores, paper concludes, paper discusses” These general phrases could appear in very heterogeneous abstracts that

otherwise have little else in common We define this type of topic as a stop topic that should

be removed from original text corpus before further analysis

Deciding on the number of topics in the model is another challenge Blei (2003)

describes the perplexity (a statistical measure for comparison of different probability models) computation in order to evaluate the performance of different models (as shown in Figure 2) Then, perplexity is used as the indicator to decide the number of topics in our LDA topic model

Trang 25

Figure 2 Perplexity evaluations of different computational language models (Blei 2003)

The two most important output files from LDA topic model are the document topic file and the topic inferencer file The document topic file gives every input document scores

on how related it is to each of the topics This file then is processed during SOM training with the aim of generating a 2-dimensional topical display space SOM training will treat topics as distinct dimensions and will thus represent each AAG abstract as a topic vector during SOM training Neurons in the SOM become associated with topic vectors of the same dimensionality as the input vectors and become the geometric features for the visualization of the geographic knowledge domain The topic inferencer, which can be used to infer topic scores for any text item, including arbitrary text entered by users later in the web GIS

application With the 2-dimensional neurons from SOM, it then can be processed in GIS tools to create a base map The proposed whole workflow is shown in Figure 3

Trang 26

Figure 3 Data processing workflow

WEB GIS APPLICATION DESIGN

There are usually two parts in a typical web-GIS framework: server-side and side On the server-side, all geographic features and their attributes are stored in database or files and mapping or geoprocessing services are then generated from them On the client-side, the mapping or geoprocessing services can be received through the Internet and can be used by applications to create functionality for different users For this proposed web GIS application, it needs to serve knowledge domain base maps through map services and other proposed functionalities through geoprocessing services (as shown in Figure 4)

client-Figure 4 Web GIS application framework

Trang 27

Slightly different from common web GIS applications, our application needs support from Natural Language Processing (NLP) services, which can process the text input There are two parts in the server side for NLP services (Figure 5): the topic model that processes text input and the SOM cells that represent the input in 2-dimensional space

Figure 5 NLP serivces

Users can request the above-mentioned functions with either text or geometry input The text input can be processed in Mallet (McCallum, 2002) using topic model inferencer to get its topic scores as high-dimensional weights and then the weights can be matched to a SOM cell to get its lower-dimensional position to be displayed for the users The geometry input can be matched to a SOM cell to get its high dimensional weights and then be

processed in high-dimensional space

Trang 28

WORKFLOW

This part implements a workflow in two environments: Java programs that

accomplishes text processing and an ArcGIS model that provides GIS processing

Text Processing Workflow

The main objective of the first part of the workflow is to extract high dimensional topics from the text corpus with which to train a SOM Topics extracted from the text corpus represent a high-dimensional topical space Representation for different topics and sub-

domains are filtered to be meaningful, in the context of knowledge domains The SOM transfers this high-dimensional topical space to a two-dimensional space that can be used for creating maps

DATA PREPROCESSING

The dataset used in this study consists of approximately 66,000 conference abstracts collected from AAG meetings in various formats over the course of 20 years (Table 1) Data are processed into one single XML schema, which can then be easily transformed to any other format Abstracts in PDF format are first exported to text (TXT) format, which contains three lines (Figure 6) The first line includes author name, author contact info and abstract

Trang 29

title; the second includes abstract content; and the third includes abstract keywords This text

file is converted to an XML format in Java There are three elements for three lines Paper

title, author name and author contact information are extracted from the first element (Figure

Figure 6 XML Processing for PDF Format Data

Abstracts in Excel format are exported to two XMLs with author and abstract

information, which are then joined together to a single XML file in Java

Every XML file, which was derived from the dataset of abstracts of varying formats,

is fed information about its corresponding abstract, including paper year, conference location and ID (Figure 7) Information about the abstract, ID, title, keywords, abstract text and author

Trang 30

info, is included in each abstract Author information includes name, author ID, and other information Then the XML file can be transformed to mallet input format (each line contains one input document with ID and content) for training of topic models Although files are in the same schema (XML), additional preprocessing is required before training the LDA topic model

Figure 7 XML Schema

Firstly, the topic model training process is set to a case-sensitive mode in order to detect capital characters All capital characters need to be transformed to lower case in terms both containing only capital characters and those in which the first character is a capital character Secondly, each noun is transformed to its singular form, as the same noun in plural and singular form would be treated as different words in topic model These two steps are shown in Figure 8

Trang 31

Figure 8 Data Content Preprocessing

LDA TOPIC MODELING

There are two preprocessing part in LDA topic modeling Firstly, it needs to filter out irrelevant text from the original corpus to ensure the quality of the output topics Secondly, it needs number of topics as input parameter for the training, which also influences the quality

of the output topics In the following part, it will discuss how to handle these two processes and discuss the output files from the LDA topic modeling

Iterative Filtering of Stop Topics and Stop

Phrases

As the LDA topic model intuitively discovers "topics", some of these may be of a syntactic or procedural nature instead of being domain-specific semantic descriptors For example, one of the topics initially generated by the model was characterized by phrase like

"paper examines," "paper explores," "paper concludes," or "paper discusses” These are not particularly relevant in the discovery of domain knowledge structures, since they are general

expressions that could appear in any abstract To make this distinction, the notion of a stop topic is introduced, which should be removed from the original text corpus before further

analysis

We extended this idea of stop topics further to define stop phrases, among which two

types are distinguished The first type is a phrase that pairs certain generic nouns (e.g.,

“challenge”, ” difficulty”, “issue”, “problem”, “paper”, “project”, “research”, “study”) with a verb (e.g., “study explores” or “challenges met”) The second type is a phrase that includes a

Định dạng
Số trang	63
Dung lượng	1,87 MB