Cluster analysis and ontology generation techniques for the development of scholarly semantic web

In the traditional Web, semantic information can be anno-tated in web pages using enhanced metadata description languages as shown in a number of systems such as SHOE [24], Ontobroker [2

Trang 1

TECHNIQUES FOR THE DEVELOPMENT OF

SCHOLARLY SEMANTIC WEB

By Quan Thanh Tho

SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY

AT SCHOOL OF COMPUTER ENGINEERING NANYANG TECHNOLOGICAL UNIVERSITY NANYANG AVENUE, SINGAPORE 639798

2005

Trang 2

Table of Contents ii

1.1 Scholarly Web Information 1

1.2 Scholarly Information Retrieval 2

1.3 The Semantic Web 3

1.4 Semantic Web-based Retrieval Systems 4

1.5 Objectives 5

1.6 Major Contributions 8

1.7 Organization of the Thesis 9

2 The Semantic Web 11 2.1 Markup Languages 11

2.1.1 Hypertext Markup Language 12

2.1.2 Extensible Markup Language 12

2.1.3 Resource Description Framework 13

2.2 The Semantic Web 14

Trang 3

2.5 Semantic Web Portals 24

2.5.1 Semantic Web Portal Architecture 25

2.5.2 Requirements on Semantic Web Portals 25

2.6 Web Services and Semantic Web Services 27

2.7 Summary 29

3 Context-based Cluster Analysis 31 3.1 Clustering Methods 32

3.1.1 Hierarchical Clustering Methods 32

3.1.2 Partitioning Clustering Methods 33

3.1.3 Other Clustering Methods 35

3.1.4 Discussion 37

3.2 Context-based Cluster Analysis 38

3.2.1 Cross-Clustering Relation Generation 39

3.2.2 Cross-Clustering Context Generation 51

3.3 Performance Evaluation 55

3.3.1 Experiment 55

3.3.2 Evaluation Measures 57

3.3.3 Experimental Results 59

3.4 Summary 64

4 Expert and Expertise Finding 65 4.1 Related Work 65

4.1.1 Expertise Recommender Systems 66

4.1.2 Web Mining for Finding Expertise 66

4.1.3 Author Co-citation Analysis Approach 67

4.1.4 Discussion 68

4.2 CCA-based Expert Finding 68

4.3 Document Clustering 70

4.3.1 Feature Selection 70

4.3.2 Pre-processing 70

4.3.3 Transformation 71

4.3.4 Document Clusters Generation 71

4.4 Author Clustering 72

Trang 4

4.4.3 Converting into Correlation Matrix 73

4.4.4 Generating Author Clusters 73

4.6 Expert Information Generation 75

4.6.1 Identifying Researchers’ Research Areas 75

4.6.2 Ranking Expert 76

4.6.3 Retrieving Expert Information 76

4.7 Expert Retrieval and Visualization 76

4.7.1 Expert Retrieval 77

4.7.2 Expert Visualization 77

4.8 Performance Evaluation 78

4.8.1 Experiment 79

4.8.2 Experimental Results 79

4.8.3 Comparison with Other Approaches 84

4.9 Summary 88

5 Research Trend Detection 90 5.1 Related Work 90

5.1.1 Semi-automatic Approaches 91

5.1.2 Automatic Approaches 92

5.1.3 Discussion 94

5.2 CCA-based Trend Detection 95

5.3 Keyword-based Clustering 96

5.3.1 Document Clustering 97

5.3.2 Publisher Clustering 97

5.3.3 Temporal Clustering 98

5.5 Trend Information Generation 102

5.5.1 Current Trend Identification 103

5.5.2 Trend Information Extraction 103

Trang 5

5.7.3 Trend Information Extraction and Retrieval 109

5.7.4 Trend Visualization 109

5.8 Summary 111

6 Fuzzy Concept Hierarchy Generation 112 6.1 Related Work 113

6.1.1 Concept Hierarchy Generation 114

6.1.2 Conceptual Clustering 114

6.1.3 Formal Concept Analysis 115

6.1.4 Discussion 116

6.2 Fuzzy Theory 117

6.3 Fuzzy Concept Hierarchy Generation 119

6.3.1 Fuzzy Formal Concept Analysis 119

6.3.2 Fuzzy Conceptual Clustering 127

6.3.3 Hierarchical Relation Generation 129

6.4 Research Concept Hierarchy Generation 132

6.4.4 Performance Evaluation 135

6.5 Machine Faults Concept Hierarchy Generation 142

6.6 News Topic Themes Concept Hierarchy Generation 149

6.7 Summary 155

Trang 6

7.1.1 Ontology Generation 158

7.1.2 Generating Ontology from Scholarly Knowledge 159

7.2 Fuzzy Ontology Generation 161

7.2.1 The FOGA Approach 161

7.2.2 Incremental Ontology Update 166

7.2.3 Research Hierarchy Ontology Generation 170

7.3 Cluster-based Ontology Generation 170

7.3.1 The COGA Approach 171

7.3.2 Experts Ontology Generation 172

7.3.3 Trends Ontology Generation 173

7.4 Ontology Integration 174

7.4.1 Ontology Integration Framework 174

7.4.2 Scholarly Ontology Generation 175

7.5 Semantic Web Representation 178

7.6 Browsing Scholarly Ontology 182

7.7 Summary 183

8 Scholarly Semantic Web 184 8.1 Related Work 184

8.1.1 Citation-based Retrieval 184

8.1.2 Semantic Web-based Information Retrieval 187

8.2 System Overview of SSWeb 188

8.3 Scholarly Semantic Web Services 189

8.3.1 Scholarly Service Provider 190

8.3.2 Scholarly Service Requester 192

8.3.3 Matchmaking Agent 193

8.3.4 Scholarly Information Retrieval 194

8.4 Summary 197

Trang 7

9.2.3 Automatic Ontology Integration 2039.2.4 Fuzzy Query Expansion using Fuzzy Concept Hierarchy 204

A.1 Refereed Conferences and Workshops 205A.2 Book Chapters 206A.3 Journals 206

Trang 8

3.1 A distance matrix 45

3.2 A cross-table of a document clustering context 52

3.3 A cross-table of an author clustering context 53

3.4 A cross-clustering context from the document and author clustering con-texts 55

3.5 Different combinations of clusters mining 57

4.1 An example of the Keyword-Author Cross-Clustering Context 74

4.2 Manually classified experts 79

4.3 Performance results based on the average F-measure 80

5.1 An example of a document clustering context 101

5.2 An example of a topic clustering context 101

5.3 An example of a temporal clustering context 101

5.4 An example of the Keyword-Topic-Temporal Cross-Clustering Context 102

5.5 Manually predefined trends in the Information Retrieval field 106

5.6 Trends identification results using the single link method 107

5.7 Trends identification results using the complete link method 107

5.8 Trends identification results using the average link method 108

5.9 Trends identification results using the Ward’s method 108

5.10 Performance results of trends information extraction 109

6.1 A cross-table of a formal context 120

Trang 9

clustering methods based on different similarity thresholds T s 134

6.7 Runtime (in sec.) required to generate conceptual clusters 134

6.8 Performance results based on precision 137

6.9 Performance results based on recall 137

6.10 Performance results based on F-measure 137

6.11 Performance comparison based on precision 138

6.13 Performance comparison based on F-measure 138

6.14 Number of research clusters using FCHG and LFCA-based conceptual clustering methods based on difference confidence thresholds T C 144

6.16 Retrieval accuracy 149

6.17 Number of research clusters using FCHG and LFCA-based conceptual clustering methods based on difference confidence thresholds T C 151

6.19 Manually classified themes of Reuters news topics 153

B.1 20 queries for performance evaluation on expert finding 207

Trang 10

1.1 System architecture of the proposed Scholarly Semantic Web 6

2.1 Representation of a publication using XML 13

2.2 Another representation of the publication using XML 13

2.3 RDF data model 14

2.4 Representation of a publication using RDF 14

2.5 Architecture of the Semantic Web 15

2.6 Representation of semantic information using SHOE 18

2.7 Representation of semantic information using Ontobroker 19

2.8 Class representation using DAML-ONT 21

2.9 Class representation using OIL 22

2.10 Class representation using DAML+OIL 23

2.11 Class Representation using OWL 24

2.12 Semantic Web Portal 25

2.13 Operational mechanism in Web Services 27

2.14 Technologies used on the operational mechanism in Web Services 28

3.1 Fuzzy clustering 36

3.3 Cross-Clustering Relation Generation 38

3.4 Vectorization from document and author clustering 41

3.5 Algorithm for distance matrix generation 46

Trang 11

3.10 Performance results based on IR measures 61

3.11 Performance results based on entropy 63

4.1 The CCA-based expert finding approach 69

4.2 Document clustering process 70

4.3 Author clustering process 72

4.5 Visualizing research experts in research areas 78

4.9 Performance comparison based on recall 85

4.11 Performance comparison based on F-measure 87

4.12 Performance comparison based on average F-measure 89

5.1 The proposed CCA-based trend detection approach 96

5.2 Document clustering 97

5.3 Publisher clustering 98

5.4 Temporal clustering 99

5.6 Performance results of trends information retrieval 110

5.7 Trend visualization 111

6.1 The proposed Fuzzy Concept Hierarchy Generation technique 119

6.2 A concept lattice generated from traditional FCA 121

6.3 A fuzzy concept lattice generated from FFCA 124

6.4 An L-fuzzy concept lattice 126

6.5 The fuzzy conceptual clustering algorithm 128

6.6 Conceptual clusters with confidence threshold T S = 0.4 130

6.7 Conceptual clusters with confidence threshold T S = 0.5 130

6.8 Conceptual clusters with associated objects and attribute sets 131

Trang 12

6.11 Performance results based on cluster goodness 140

6.12 Performance results based on AUP 141

6.13 An example fault-condition and checkpoint in a customer service record 142 6.14 A part of the Machine Faults Concept Hierarchy for machine model AV 2011 145

6.15 Performance results based on IR measures 147

6.16 A part of the News Topic Themes Concept Hierarchy 152

6.17 Average performance results based on IR measures 155

6.18 Performance comparison based on IR measures 156

7.1 An example scholarly ontology 162

7.2 Fuzzy ontology generation process 164

7.3 Research Hierarchy Ontology 170

7.4 Cluster-based Ontology Generation Framework 171

7.5 Experts Ontology 173

7.6 Trends Ontology 173

7.7 Ontology Integration Framework 175

7.8 Sets of ontologies’ classes 176

7.9 Integration rules 176

7.10 Preliminary Scholarly Ontology 177

7.11 Scholarly Ontology 179

7.12 An example of Trends Ontology classes represented in OWL Full 180

7.13 An example of a part of the Scholarly Ontology represented in OWL Full 181 7.14 Browsing research areas on the Scholarly Ontology 182

7.15 Browsing documents on the Scholarly Ontology 182

8.1 General architecture of a typical citation-based retrieval system 185

8.2 Scholarly Semantic Web 189

Trang 14

The Web has become one of the most important media for storing scholarly relatedinformation Search engines such as Google and Yahoo are insufficient to help searchrelevant scholarly information The current citation-based retrieval systems such asISI (Institute for Scientific Information) and CiteSeer can only provide basic citationsearch support which is unable to cater for the needs of the growing scholarly researchcommunity Moreover, the scholarly knowledge discovered from citation database cannot

be shared among different citation-based retrieval systems

Nevertheless, a citation database contains very useful semantic scholarly knowledgewhich can be further explored for supporting advanced search functions such as expertfinding and trend detection The recent development of the Semantic Web has provided

a very suitable environment for supporting the sharing of scholarly knowledge amongdifferent research communities Therefore, in this research, we aim to develop a SemanticWeb-based system for the sharing and retrieval of scholarly information based on acitation database

To achieve the goal, we have proposed the Scholarly Semantic Web (or SSWeb), whichorganizes scholarly knowledge as an ontology that is distributed across the SemanticWeb As such, scholarly information can be managed and refined by domain experts,and can be shared and accessed by programs Moreover, the proposed SSWeb systemcan support not only typical scholarly document and citation-based retrieval, but alsoadvanced scholarly search functions such as expert finding, trend detection and fuzzydocument retrieval

Trang 15

search areas The advantage of the proposed approach is its capability in findinginformation on experts from a global perspective instead of limiting only to aspecific organization or domain.

• A CCA-based trend detection approach has been developed to detect trends in

research areas The proposed approach can detect research trends from a citationdatabase in a fully automatic manner

• A fuzzy concept hierarchy generation technique has been developed for fuzzy

con-cept hierarchy generation in domains with uncertain information This techniquecan be used for supporting fuzzy document retrieval on the Scholarly SemanticWeb

• Ontology generation approaches have been developed for generating Scholarly

On-tology The Scholarly Ontology is generated from integrating the different scholarlyknowledge discovered from citation database

• A distributed architecture for Scholarly Semantic Web has been developed to

sup-port scholarly information retrieval over the Semantic Web environment In theScholarly Semantic Web, the scholarly knowledge is shareable over multiple loca-tions through Semantic Web Services

Trang 16

I would like to express my gratitude and sincere appreciation to my supervisor, Assoc.Prof Hui Siu Cheung, for his guidance, constant encouragement, thoughtful criticismand invaluable suggestions Without his support and invaluable patience, the thesiswould not have been possible He has also always kept on pushing me to advance tohigher research levels.

I would like to thank Dr Alvis Fong for his review and invaluable comments on myresearch

I would also like to thank Dr Tru Hoang Cao for rendering help in my work andcareer

I would also like to thank Dr He Yulan for her help and useful related materialsprovided about the citation database in the initial stage of my research

I would also like to thank Nanyang Technological University (NTU) for providing methe financial support to do research in Singapore

I want to thank all my friends and colleagues, who assisted me in many ways out the duration of the research In particular, I would like to express my special thanks

through-to my friend, Mr Do Tien Dung, for his constant supports and helpful advices

My gratitude also goes to Mr Teo Choo Eng and Ms Eng Hui Fang, the laboratorytechnicians of the Database Technology Laboratory, for their technical support and help

I am also grateful to the staffs in the BioInformatics Research Centre (BIRC) for theirefforts to maintain a good working environment for my research

Last but not least, my deepest gratitude goes to my family, my parents and sister,

Trang 17

With the rapid growth of the World Wide Web (or Web), there is more and more formation about scholarly scientific publications stored on the Web Researchers canaccess and even download scientific documents from various online repositories on theWeb Electronic archives such as CiteBase [1] and Open Archives Initiative (OAI) [2]provide free access to scholarly documents In electronic journals such as Elsevier [3]and IOP [4], scientific documents are also provided with links to discussions, relateddocuments, and notification and alerting services Digital libraries [5] such as the ACMDigital Library [6] and ScienceDirect [7] contain a large quantity of organized publica-tions collected from various publishers in various forms (e.g articles and books) andformats (e.g texts, animations and movies) In fact, most of the scientific research workpublished in scholarly journals and conference proceedings are now available online inthe form of digital libraries [8]

in-As a result, researchers can always browse and search the Web to keep abreast ofthe research trends that are relevant to them, and focus more on new research issues.Currently, researchers from most universities and institutions rely heavily on the Web

to find their related scholarly information [9] However, as the number of scientific uments available on the Web increases tremendously, efficient and effective mechanismsare needed in order to help researchers to locate their related scholarly information anddocuments

Trang 18

doc-1.2 Scholarly Information Retrieval

A number of search engines such as Google [10] and Yahoo [11] have been developed tohelp users find information on the Web In particular, Vivisimo [12] is a search enginethat applies clustering techniques to categorize search results These search engines re-trieve documents relevant to the search queries based on keywords In order to supportsearch queries, search engines crawl the web sites and index them based on the extractedkeywords and linkage information Generally, search engines are not very effective inhelping researchers to find related scholarly information because they are mainly devel-oped for general searching purposes The returned search results are too generalizedand often mixed up with non-scholarly related information Thus, specialized search en-gines or systems that can overcome the over-generalization problem of traditional searchengines such as ISI [13] or CiteSeer [14, 15, 16, 17] have been developed for retrievingscholarly-related information

In scholarly publications, papers or books are usually cited as references Citationsare used to help readers to locate relevant papers for further understanding on the dis-cussed topic Citation indexes contain references that the documents cite They providelinks between source documents and the cited documents Thus, citation indexes provideuseful information to help researchers to conduct scientific research, such as identifyingresearchers working on their research areas, finding publications from a certain researcharea, and even analyzing research trends Therefore, citation indexes are employed byspecialized search engines to index scientific publications for retrieval Citation indexesare stored in citation databases In a citation database, citation indexes are storedtogether with publication-related information such as title, author, keywords, date ofpublication, etc

Citation-based retrieval systems have been developed for retrieving scholarly tion Institute for Scientific Information (ISI) [13] provides a commercial citation-basedretrieval system over the Web, which allows users to search for cited documents fromthe citation database CiteSeer [14, 15, 16, 17], which is also known as ResearchIndex,

Trang 19

informa-such as cited and citing documents, are then retrieved In addition to traditional tion search, PubSearch [18, 19] is a retrieval system that also supports document clus-tering retrieval and author clustering retrieval Similar documents and authors working

cita-in similar research areas can be retrieved based on cita-input keywords and author namesrespectively

Although citation-based retrieval systems have provided useful search functions forfinding scientific publications, they still have the following limitations:

• Search functions Only basic search functions such as keyword search, author

search and traditional citation search are supported However, this will not besufficient to keep up with the needs of the rapidly growing scholarly community.Advanced search functions such as expert finding in certain research areas, andthe detection of current research trends are highly desirable

• Sharing of scholarly information There is currently no significant sharing of

knowl-edge between different citation-based retrieval systems As such, the exchange ofknowledge is difficult This will hamper the development of a global, distributedenvironment for supporting scholarly information retrieval

Currently, information available on the Web has been designed for humans to stand Programs can be written to process, analyze and index web pages to help human

under-to process the information However, due under-to the lack of machine-readable structure andknowledge representation in web documents, programs are unable to comprehend webpage contents precisely, and hence semantic information from web documents cannot

be extracted As such, a method for representing knowledge such that programs canunderstand, share and exchange the knowledge is needed

To tackle this problem, Tim Berners-Lee et al [20] has proposed the Semantic Web,

which is an extension of the current Web, in which information is given well-definedmeaning, better enabling computers and people to work in cooperation Ontology isused to represent knowledge on the Semantic Web Generally, ontology is a concep-tualization of a domain into a human understandable, but machine-readable format

Trang 20

consisting of entities, attributes, relationships and axioms [21] As such, programs canuse the knowledge from the Semantic Web for processing information in a semantic man-ner Web Services [22] have been introduced to make the knowledge conveyed by theontology on the Semantic Web accessible across different applications Semantic WebServices [23] represent Web Services as ontologies, thereby making the provided servicesnot only accessible but understandable by other programs.

The Semantic Web has provided a very suitable environment for supporting knowledgemanagement and retrieval In the traditional Web, semantic information can be anno-tated in web pages using enhanced metadata description languages as shown in a number

of systems such as SHOE [24], Ontobroker [25], WebKB [26], Quest [27], Expressive andEfficient Language for XML Information Retrieval (ELIXIR) [28] However, due to thelack of a standard for knowledge annotation, such systems are still unable to share theirinformation

With the advancement of the Semantic Web, Semantic Web Portals [29] (SW Portals)have been developed based on the Semantic Web technologies AIFB [30], Esperonto[31], OntoWeb [32], Embolis K42 [33] and Mondeca ITM [34] are some well-knownportals that are currently being developed Generally, Semantic Web-based informationretrieval is supported through Semantic Web Services provided by SW portals To help

a retrieval system to locate suitable Semantic Web Services automatically, intelligent

agents called matchmaking agents [35] or matchmaker agents [36] have been proposed.

Swoogle [37], a Semantic Web-based search engine, has used intelligent agents as crawlers

to collect information provided by existing Semantic Web Services over the Web Thecollected information is then indexed for information retrieval

Recently, several Semantic Web-based systems have been developed for the retrieval

of scholarly information In E-scholar Knowledge Inference Model (ESKIMO) [38, 39],

a Semantic Web-based scholarly information management system has been developed

Trang 21

been developed to support the retrieval of scholarly publications from on-line archivesbased on the Semantic Web.

However, one of the major obstacles for developing Semantic Web-based retrievalsystems is on the construction of ontology for the corresponding domain In the scholarlydomain, the scholarly ontology of the existing Semantic Web-based retrieval systems isconstructed mainly based on explicit information from scientific documents (such astitles, authors and abstracts) or using manual methods However, the explicit documentinformation can only provide knowledge or ontology for supporting basic search functionssuch as keyword search or author search And it is a tedious and difficult task toconstruct scholarly ontology manually

The Web has become one of the most important media for storing scholarly relatedinformation Search engines are insufficient to help search relevant scholarly informa-tion The current citation-based retrieval systems can only provide basic citation searchsupport which is unable to cater for the needs of the growing scholarly research com-munity Moreover, the discovered scholarly knowledge from citation database cannot

be shared among different citation-based retrieval systems Nevertheless, a citationdatabase contains very useful semantic scholarly knowledge that can be further exploredfor supporting advanced search functions such as expert finding and trend detection.The development of the Semantic Web has provided a very suitable environmentfor supporting the sharing of scholarly knowledge among different scholarly researchcommunities However, one of the challenges for the development of Semantic Web-based retrieval systems is on the construction of scholarly ontology The constructionprocess for scholarly ontology should be easy and preferably automatic rather thanmanual, which is tedious and time-consuming

This research aims to develop a Semantic Web-based system for the sharing andretrieval of scholarly information based on a citation database The proposed system

is known as Scholarly Semantic Web (or SSWeb) The proposed SSWeb system willorganize scholarly knowledge as ontology, which is distributed on the Semantic Web Assuch, scholarly information can be managed and refined by the corresponding domain

Trang 22

Citation Database

Cluster Analysis

Ontology Generation

Organization 1 Scholarly Ontology

Figure 1.1: System architecture of the proposed Scholarly Semantic Web

experts, and can be shared and accessed by programs Moreover, the proposed SSWebsystem can support not only typical scholarly documents and citation-based retrieval,but also advanced scholarly search functions such as expert finding, trend detection andfuzzy document retrieval, in which fuzzy membership can be used to weight the queryterms accordingly to improve the retrieval performance

Figure 1.1 shows the proposed distributed architecture of the Scholarly SemanticWeb In SSWeb, each organization (or institution) maintains its own scholarly ontology

To generate scholarly ontology, we first investigate data mining techniques based on

Trang 23

ontology Semantic Web Services will also be investigated in order to provide the supportfor scholarly information retrieval.

Therefore, to achieve our primary aim on developing the Scholarly Semantic Web,

we will carry out the research in the following areas:

• Cluster Analysis Advanced search functions such as expert finding and trend

detection are highly desirable for scholarly research community In this research,

we will investigate a clustering technique for analyzing cluster relationships amongmultiple clustering results from data on documents, authors, publishers and date

of publication of citation database The discovered cluster relationships will beused for developing the functions for expert finding and trend detection

• Fuzzy Conceptual Clustering Traditional clustering techniques may not be

suffi-cient for discovering and representing certain types of scholarly information such

as hierarchical relations and uncertain information that commonly occurs in arly documents The incorporation of hierarchical structures of scholarly knowl-edge with uncertain information will provide support for fuzzy document retrieval,that will enhance the retrieval performance In this research, we will investigate afuzzy-based conceptual clustering technique for deriving scholarly knowledge fromuncertain information of scholarly documents from citation database The derivedknowledge will be hierarchically organized as a fuzzy concept hierarchy

schol-• Ontology Generation In order to make the scholarly knowledge derived from

cluster analysis and fuzzy conceptual clustering available on the Semantic Webenvironment, it is necessary to convert the knowledge into an ontology formalism

In this research, we will investigate techniques for automatic generation of arly ontology from the generated scholarly knowledge In addition, we will alsoinvestigate the integration of different types of ontologies that are generated fromcluster analysis and fuzzy conceptual clustering

schol-• Scholarly Semantic Web Services To provide scholarly information retrieval

ser-vices over the Semantic Web, we will investigate a Semantic Web-based ture for the delivery of Scholarly Semantic Web Services The proposed architec-ture should enable the retrieval of scholarly information from multiple Scholarly

Trang 24

architec-Service Providers, thereby enabling the sharing and reusability of scholarly edge in a distributed manner.

As a result of this research, we have developed different novel techniques in the areas ofknowledge management and data mining listed in Section 1.5 The major contributions

of this research are summarized as follows:

• Context-based Cluster Analysis (CCA) Technique We have proposed a cluster

analysis technique based on the Formal Concept Analysis (FCA) [42] theory Theproposed technique aims to find relationships among multiple sets of resultantclustering data The CCA technique is designed generically so that it can be used

by other applications In this research, the CCA technique is used to supportadvanced search functions for expert finding and research trend detection

• CCA-based Expert Finding Approach We have proposed a CCA-based approach

for finding experts in research areas The proposed approach is able to find mation on experts from a global perspective instead of limiting only to a specificorganization or domain as in most existing approaches In addition, the proposedapproach can also provide information on expertise or research areas, which arerepresented by some significant keywords, of the experts The identified expertsand expertise information can be used for visualization and information retrieval

infor-• CCA-based Research Trend Detection Approach We have proposed a CCA-based

approach for detecting trends in research areas The proposed approach can detectresearch trends in a fully automatic manner In addition, the trend detected usingthe proposed approach can always be treated as current trends and the associatedstatistical information can be gathered without being limited to any specific timeperiods The detected trends can be used for information retrieval

Trang 25

uncertain data, and then constructs the fuzzy concept hierarchy through the posed fuzzy conceptual clustering technique This technique aims to construct

pro-a concept hierpro-archy of conceptupro-al clusters of resepro-arch pro-arepro-as for supporting fuzzydocument retrieval on the Scholarly Semantic Web

• Scholarly Ontology Generation We have proposed an ontology generation

ap-proach for generating scholarly ontology from the scholarly knowledge discoveredfrom citation database The scholarly knowledge that are converted into ontologyinclude expert knowledge, trend knowledge and research concept hierarchy Thescholarly ontologies derived from a citation database are then integrated and pop-ulated into Scholarly Ontology The generated Scholarly Ontology can be used tosupport advanced scholarly information retrieval functions There may be manyScholarly Ontologies generated and stored in various academic Semantic Web sites

in a distributed manner using our techniques

• Scholarly Semantic Web Architecture In this research, we have proposed a

dis-tributed architecture for the Scholarly Semantic Web With the proposed chitecture, the system is capable of exploring, exchanging and sharing scholarlyinformation on the Semantic Web environment The proposed system provides notonly basic citation-based search functions, but also advanced search functions forexpert finding, trend detection and fuzzy document retrieval

This chapter has discussed the background and motivation of this research work Theobjectives of the research have been given We have also listed the contributions thathave been achieved The rest of the thesis is organized as follows

Chapter 2 reviews the Semantic Web and the state-of-the-art Semantic Web nologies, which include ontology, Semantic Web Portals and Semantic Web Services

tech-In Chapter 3, we discuss the proposed cluster analysis technique for mining clusterrelationships from multiple clustering data The technique, which is known as Context-based Cluster Analysis, is capable of representing cluster relationships among multipleclusters as mathematical models The performance of the proposed technique is also

Trang 26

Chapter 4 and Chapter 5 discuss the proposed CCA-based approaches for findingexperts and detecting trends in research areas The performance of the proposed CCA-based approaches is also evaluated

In Chapter 6, we discuss the proposed fuzzy concept hierarchy generation technique,which is based on FCA theory The proposed technique is used to construct a fuzzyconcept hierarchy from uncertain data The performance of the proposed technique isevaluated based on three applications for the generation of research concept hierarchy,machine faults concept hierarchy and news topic themes concept hierarchy

Chapter 7 discusses the proposed techniques for generating scholarly ontology fromthe discovered scholarly knowledge This chapter discusses the three approaches, namelyFuzzy Ontology Generation frAmework (FOGA), Cluster-based Ontology GenerationfrAmework (COGA) and Ontology Integration Framework (OIF), for ontology genera-tion

Chapter 8 presents the proposed system on the Scholarly Semantic Web In this ter, the distributed architecture of the system is given We then discuss the ScholarlySemantic Web Services that enable the scholarly knowledge understandable, sharableand accessible on the Semantic Web environment

chap-Finally, Chapter 9 concludes the thesis with a summary and states the future tions for further research works

Trang 27

direc-The Semantic Web

In this chapter, we review the development and the state-of-the-art of technologies forthe Semantic Web First, we discuss the traditional World Wide Web, which is theprecursor of the Semantic Web, and markup languages Then, the Semantic Web isintroduced Next, we discuss ontology, which is adopted for knowledge representationfor the Semantic Web, and in particular ontology description languages Finally, wediscuss Semantic Web Portals for Semantic Web applications, and Web Services for thedelivery of services on the Semantic Web

The World Wide Web, proposed by Tim Berners-Lee [43], is the universe of accessible information On the Web, information is provided via web resources, such astext, images, multimedia files, etc A web resource is indexed as a Uniform ResourceIdentifier (URI) or Uniform Resource Locator (URL) address And metadata [44] isused to describe web resources Metadata basically is structured data about data Incomputer science, metadata is used to describe resources in order to be better understood

network-by programs On the Web, metadata is associated with web resources, which is given

in a markup language, to describe web information There are mainly three markuplanguages, namely Hypertext Markup Language (HTML), Extensible Markup Language(XML) and Resource Description Framework (RDF), for describing web resources

Trang 28

2.1.1 Hypertext Markup Language

The Hypertext Markup Language (HTML) [43] is used to mark up documents usingHTML tags HTML is used to describe the structure of documents, so that programscan understand and parse web documents It is considered as a standard representationfor metadata in the traditional Web However, as HTML tags are mainly used forpresentation purposes by web browsers, the number of HTML tags supported is limited.Moreover, HTML does not allow users to define their own tags Thus, HTML provideslittle support for semantic information There is an extension of HTML in Ontobroker[45] that enables users to self define some limited tags, but it causes the information to

be doubled when annotated on Web pages Hence, it is not so practical Therefore, anew markup language for metadata is needed for representing semantic information

2.1.2 Extensible Markup Language

The Extensible Markup Language (XML) [46] was introduced by XML Working Group ofthe World Wide Web Consortium (W3C) It is developed from the Standard GeneralizedMarkup Language (SGML) [47] XML syntax is similar to HTML It also provides a set

of rules on using tags to markup documents As such, documents can be represented

in consistent syntax In addition, XML enables users to define new markup types todescribe information using Document Type Definition (DTD) Thus, XML can overcomethe weakness of HTML as XML enables users to define their own tags For example,tags used to markup scientific publications can be represented as shown in Figure 2.1.XML is general, extensible and open to describe information in documents However,different ways can be used to represent the same semantic information using XML Figure2.2 gives another representation for the knowledge presented in Figure 2.1 Due to thelack of standard for knowledge representation, XML cannot offer syntactic support butonly semantic support when used for representing information That is, informationrepresented in an XML document can be parsed easily by computer programs but it isdifficult to obtain semantic meaning from the parsed information This disadvantage,

Trang 29

<!DOCTYPE publication SYSTEM ”http://www.imdb.com/publication.dtd””>

<article title = ”Semantic Web Roadmap” date = 2001>

<author>Tim Berners-Lees</author>

<article>

Figure 2.2: Another representation of the publication using XML

2.1.3 Resource Description Framework

Resource Description Framework (RDF) [48] is based on the XML syntax to representinformation in a semantic manner A RDF data model consists of a triple: subject,predicate and object It provides a standard for representing semantic relations, i.e asubject is a predicate of an object This is shown in Figure 2.3 For example, the relationbetween a researcher and a publication can be considered as a triple: a researcher (sub-ject) is an author (predicate) of a publication (object) This relation can be representedusing RDF as shown in Figure 2.4

RDF gives a standard to describe semantic relations As such, information in a RDFdocument can be comprehended by different computer programs with the same meaning

In other words, RDF provides not only syntactic support but also semantic support whenrepresenting knowledge As such, RDF overcomes the limitation of XML and becomesvery useful for representing semantic information on the Web Currently, to representontological knowledge, there are many ontological languages developed based on RDF

Trang 30

Subject Predicate Object

Figure 2.3: RDF data model

<rdf:RDF>

<rdf:Description about = ”http://www.w3.org/DesignIssues/Semantic.html”>

<s:author>Tim Berners-Lees</s:author>

</rdf:RDF>

</rdf:Description about = ”http://www.w3.org/DesignIssues/Semantic.html”>

Figure 2.4: Representation of a publication using RDF

Currently, information available on the Web has been designed for human to understand.Programs can be written to process, analyze and index web pages to help human toprocess the information However, due to the lack of machine-readable structure andknowledge representation in web documents, programs are unable to comprehend webpage contents precisely, and hence semantic information from web documents cannot

be retrieved As such, a method for representing knowledge such that programs canunderstand, share and exchange knowledge is needed To tackle the problem, Tim

Berners-Lee et al [20] has proposed the Semantic Web, which is defined as an extension

of the current one, in which information is given well-defined meaning, better enablingcomputers and people to work in cooperation

The basis for the Semantic Web is on its ability to represent real-life domains rately so that it enables programs to completely understand the environment in whichthey operate on In summary, the Semantic Web provides the following benefits:

Trang 31

accu-Figure 2.5: Architecture of the Semantic Web.

the Semantic Web Hence, knowledge carried on the Semantic Web can be sharedand reused among different programs

• Users can interact with programs using a semantic query language to specify their

requests and thereby improving the retrieval performance

• Deductive mechanism that is used to derive new information from existing

infor-mation can be described clearly, so that knowledge can be reasoned with efficiently.The architecture of the Semantic Web consists of seven layers [20] Currently, thisarchitecture is considered as a standard for developing the Semantic Web Figure 2.5shows the architecture of the Semantic Web, which comprises the following layers: Foun-dation, XML Schema, RDF Schema, Ontology, Logic, Proof and Trust They are brieflydescribed as follows:

• Foundation Layer This layer consists of the basis addressing protocol Uniform Resource Identifier (URI) and the document encoding method Unicode That is,

the Semantic Web uses URI to identify resources and uses Unicode to encode thedocuments

• Schema Layer This layer comprises the XML + NS (Namespace) + xmlschema

layer and the RDF + rdfschema layer This layer defines objects and classes,

Trang 32

their relations and constrains The XML Schema (XMLS) [49] and RDF Schema(RDFS) [50], which are based on XML and RDF respectively, are used for theselayers Currently, RDFS has widely been used to describe classes at the SchemaLayers.

• Ontology Layer This layer provides constructs on using meta-information to

rep-resent domain knowledge In this layer, information is reprep-resented as ontology,which is adopted by the Semantic Web to define knowledge

• Logic Layer This layer infers more knowledge from the existing knowledge It is

often integrated with the Ontology Layer In this layer, concepts and relationshipsdefined in lower layers are converted into Turing-complete logic languages [51] inorder to generate new knowledge

• Proof Layer This layer provides a mechanism which confirms whether a statement

is true or not

• Trust Layer This Layer provides a mechanism which resolves conflicts between

knowledge carried by the Semantic Web to form the ”Web of Trust.”

• Digital Signature Layer This layer uses public key cryptography [52] to secure

documents

As discussed earlier, metadata is the key technology to represent knowledge on the

Se-mantic Web To represent domain knowledge, ontology has been proposed as the primary

metadata, since it can describe knowledge in a domain expressively and accurately [53].Ontology has many definitions [54] Tom Gruber [55] gives the most commonly cited def-

inition on ontology, which defines ontology as a formal, explicit specification of a shared

conceptualization Conceptualization refers to an abstract model of phenomena in the

Trang 33

Currently, ontology is regarded as a standard conceptual model for knowledge

rep-resentation, especially on the Semantic Web research area Recently, the term ontology

engineering has been used to imply ontology-related research in computer science [56].

The current issues on ontology engineering include ontology generation, ontology ping [57, 58], ontology integration [59, 60, 61] and ontology versioning [62] Amongthem, ontology generation is our major research topic in this thesis Thus, a detaileddiscussion on ontology generation techniques will be given in Chapter 7

map-To evaluate the quality of ontologies used in applications, various ontology evaluationmethods have been proposed [63] ONTOMETRIC [64] is a framework proposed toevaluate and select ontologies that match with criteria predefined by a system Ininformation retrieval systems, typical information retrieval measures such as precision,recall and F-measure are commonly used for ontology evaluation [65, 66] For text-based ontology, evaluation methods based on word similarity and frequency are proposed[67, 68] ODEVal [69] is an ontology evaluation method that is based on graph theory todetect inconsistences and redundancies on the concept taxonomies of ontologies In thisresearch, we intend to develop Scholarly Semantic Web supporting scholarly informationretrieval Thus, the information retrieval measures (e.g recall and precision) will beused to evaluate the ontological knowledge constructed

In the Semantic Web, an ontology is described using an ontology description language.Ontology description languages are based on Web metadata description languages, whichcan be classified into the following three groups: HTML-based, XML-based and RDF-based

2.4.1 HTML-based Ontology Description Languages

The tags supported by the traditional Web are sufficient to represent some semanticknowledge Simple HTML Extension (SHOE) [70, 24] and Ontobroker [71] have embed-ded additional tags into HTML in order to represent knowledge Figure 2-6 gives an

Trang 34

Figure 2.6: Representation of semantic information using SHOE.

example of using SHOE to describe information of a student Figure 2-7 gives an ple of using Ontobroker to annotate a researcher’s home page, in which the researcher’sfirst name and last name are annotated However, as discussed in Section 2.1.1, HTMLdoes not support self-defined tag Therefore, HTML-based is difficult to define classesfor ontology Hence, XML-based ontology description languages are then proposed toovercome this limitation

exam-2.4.2 XML-based Ontology Description Languages

XML-based ontology description languages are usually based on XML Schema (XMLS)

or Document Type Definition (DTD) As discussed in Section 2.1, DTD allows users todefine new markup types to describe information Therefore, users can define ontologyclasses using DTD Moreover, XMLS supports the definition of relations between classes

Trang 35

Figure 2.7: Representation of semantic information using Ontobroker.

when representing ontological knowledge [45]:

• XML lacks of a mechanism to define some specific relationships that are usually

central in ontologies such as is-a or element-of relationships.

• XML does not support any notion of inheritance, which is an important attribute

on ontologies

• In XML, concepts are defined through tags, which can be either a string or a

com-bination of other nested tags Such mechanism may not be sufficient for definingconcepts in ontology, which may require richer data structures to be represented

• In XML, the order of tags appearing in a document must be previously defined.

In contrast, the ordering of attribute description does not matter on ontology

As such, RDF, an extension of XML, is more popularly used for ontology description

on the Web since RDF can support mechanisms to potentially overcome these problems

Trang 36

2.4.3 RDF-based Ontology Description Languages

As discussed earlier, RDF extends XML to define a standard for knowledge tation In addition, RDF Schema, i.e RDFS, can be used to define classes and classhierarchies in domains In ontology representation, the standardization supported byRDF provides two important contributions [45]:

represen-• A standard set of modeling primitives (e.g class, instance, etc.) and their

rela-tionships (e.g subclass) are provided.

• A standardized syntax for writing ontologies is supported.

These contributions help RDF to overcome problems faced by XML when senting ontological knowledge Hence, RDFS can be used to represent ontology moreeffectively as compared to XML The popular RDF-based ontology description languagesinclude DARPA Agent Markup Language (DAML), Ontology Inference Language (OIL),DAML+OIL and Web Ontology Language (OWL)

repre-DARPA Agent Markup Language

The DARPA Agent Markup Language (DAML) [72], or DAML-ONT, extends RDFS torepresent ontology using the object-oriented approach It embeds some object-orientedconcepts to represent classes Thus, the class representation of DMAL-ONT is betterthan that of RDF Figure 2-8 gives an example of using DMAL-ONT to represent theclass ”Journal”, which is a subclass of the class ”Publication Medium”, but is disjointwith classes ”Conference” and ”Workshop” (i.e an object which belongs to class ”Jour-nal” cannot belong to classes ”Conference” or ”Workshop”)

Ontology Inference Language

The Ontology Inference Language (OIL) [73] extends RDFS to represent ontology It isdesigned based on the following three criteria:

Trang 37

Figure 2.8: Class representation using DAML-ONT.

• Description Logic It describes knowledge using logic rules Thus, knowledge is

represented mathematically and can be processed by programs

• Using Web Standard It is based on XML and RDFS.

Figure 2-9 gives an example of class representation using OIL In Figure 2-9, theclass ”animal” is first defined, followed by the class ”plant” However, the class ”plant”

is defined with the operator ”NOT” to state that it is stricly not identical with the class

”animal” (i.e objects which belong to the class ”animal” cannot belong to the class

”plant” and vice versa) Finally, the class ”tree” is defined as a subclass of ”plant”.Compared with DAML, OIL can present class properties better, but DAML canrepresent class relationships more clearly Hence, they can be combined to form a betterontology description language

DAML+OIL

DAML+OIL [74] combines DAML and OIL It defines class relationships based onDAML Class properties are defined in a similar way as OIL Hence, DAML+OIL takesthe advantages of both DAML and OIL Figure 2-10 gives an example of class representa-tion using DAML+OIL The piece of DMAL+OIL code in Figure 2-10 begins with someheader information stating URLs that define the syntaxs of RDF, XML, XML schemaand DAML+OIL The defined ontology is then given with its name (i.e ”Scholarly Infor-mation”) together with its version and comments Then, the basic DAML+OIL classesdefined in the URL ”http://www.w3.org/2001/10/daml+oil” is imported Next, thereare three classes defined, which are ”Publication Medium”, ”Conference” and ”Jour-nal” ”Conference” and ”Journal” are subclasses of ”Publication Medium”, and they

Trang 38

Figure 2.9: Class representation using OIL.

are disjoint to each other

Web Ontology Language

Web Ontology Language (OWL) [75] is extended from DAML+OIL in order to enableusers to define the various types of relationships between classes Properties can also

be defined using additional constructs in OWL OWL has three sub-languages, which

are OWL Lite, OWL DL and OWL Full Even though the same OWL syntax is used

among these sub-languages, they have slight difference in design targeting at differentcommunities of users as follows:

• OWL only supports classification hierarchy and simple constrains when designing

classes

• OWL DL includes all OWL language constructs but they can be used only under

certain restrictions (e.g a class cannot be an instance of another class)

• OWL Full allows all OWL language constructs to be used without any restriction.

Trang 39

Figure 2.10: Class representation using DAML+OIL.

is similar to that of Figure 2-10 Next, the ontology name (”Scholarly Information”)and its version are given Subsequently, three classes, namely ”Concept1”, ”Concept2”and ”Concept3” are defined ”Concept1” is labeled with the keyword ”Data Mining”,Concept 2 with ”Fuzzy Logic” ”Concept3” is a subclass of both ”Concept1” and ”Con-cept2”, and is labeled with keywords ”Data Mining” and ”Fuzzy Logic”

Trang 40

Figure 2.11: Class Representation using OWL.

The rapid development of the Semantic Web has prompted the development of cations based on the Semantic Web environment Many Semantic Web-based systemssuch as AIFB [30], Esperonto [31], OntoWeb [32], Embolis K42 [33] and Mondeca ITM

appli-[34] have been developed These systems are referred to as Semantic Web Portals [29]

Định dạng
Số trang	251
Dung lượng	1,69 MB