Data Mining and Knowledge Discovery Handbook, 2 Edition part 126 ppt

64.3.2 Crawling The process of automatically obtaining scientiﬁc papers and data about the paper such as the name of the author or authors, the publication date, the name of the journal,

Trang 1

Fig 64.10 NHECD model.

64.3 NHECD implementation

NHECD is built around Documentum11, an enterprise content management system (ECM) Documentum acts as a central repository for documents (e.g., unstructured data), metadata (semi structured to structured data, mostly in XML) and extracted information (mostly struc-tured, tabular data)

NHECD assumes that all target scientiﬁc papers to be included in the repository are in Adobe PDF format A review of several websites hosting scientiﬁc papers shows that this is

a safe choice If other formats are found the crawler can be instructed to convert the format found to PDF almost seamlessly

64.3.1 Taxonomies

Each taxonomy deals with a certain aspect of the Nanotox domain They were built by teams of domain experts and information management experts Taxonomies are not an expected output

of NHECD Yet, they are essential to the NHECD process Hence it was one of the initial steps

in NHECD implementation The taxonomies are:

Following are the taxonomies describing the subject “commercial NP characterization”

64.3.2 Crawling

The process of automatically obtaining scientiﬁc papers and data about the paper (such as the name of the author or authors, the publication date, the name of the journal, keywords, abstract,

11http://www.emc.com/products/family/documentum-family.htm

Trang 2

NHECD rating

information extraction

annotation

metadata

y1 x0

y0

x1

taxonomies

Fig 64.11 NHECD process

Fig 64.12 Commercial characterization of NP

Trang 3

Table 64.1 Taxonomies.

Subject Taxonomies

Species

Metabolism

Distribution

effecting agents

Article

Electron beam methods

Optical methods

and in general any detail made available with the paper itself) by visiting scientiﬁc paper repositories available on the web (whether restricted to subscribers or available to everyone)

and searching by keywords on the paper text is called crawling NHECD developed a crawler

for Pubmed12 Crawlers for other leading scientiﬁc sites, such as, ISIWEB13or SciFinder

12http://www.ncbi.nlm.nih.gov/pubmed/

13http://www.webofknowledge.com/

Trang 4

14are in a development stage The main obstacles found often refer to intellectual property issues and the efforts by the publishers to enforce it

The crawler is written in java It takes as input a set of keywords Using websites API15

it obtains a list of pointers to the targeted scientiﬁc papers Those pointers are processed to

transform it into downloadable links If a downloadable link is obtained, the scientiﬁc paper

is downloaded, provided that NHECD has access to the paper (e.g., there is a subscription to the resource or it is publicly available) The paper (if available, otherwise its place holder), along with the metadata already converted to XML, are uploaded to the NHECD document repository

64.3.3 Information extraction

The goals of information extraction in NHECD are:

1 To enable users to ask speciﬁc questions about speciﬁc attributes and receive answers If possible, a link to the paper is given, along with a pointer to the location of the requested information within the document

2 To enable, in the future, data mining on extracted data (e.g., patterns)

The process starts with a multistep preprocessing stage:

1 convert the input documents from PDF to text

2 perform parsing and stemming

3 perform zoning within the document

4 classify the document according to NHECD taxonomies

Next in the process is the tagging stage, used to recognize keywords, either by using the

taxonomies or the values involved As an example, the input phrase

“To determine the effect of particle size, labeled microspheres of 500 and 1000

nm in diameter were incubated with mouse melanoma B16 cells”

would result in the tagged form

“To determine the effect of particle size, labeled microspheres of <NUMBER_1> and <NUMBER_2> <LENGTH-UNIT_3> in diameter were incubated with

<SPECIES_4> <CELL-TYPE_5> <CELL-LINE_6> cells”

The pattern matching stage is based on the output of previous stages and on the process

of annotation, an auxiliary step performed by Nanotox domain experts to prepare a training set for this stage

The tasks needed to obtain patterns are:

1 Deﬁne the list of features to be extracted (based on the taxonomy)

14http://pubs.acs.org/

15Application Program Interface

Trang 5

2 For each feature that needed to be extracted we deﬁne a list of extraction patterns

3 Each extraction pattern (p) consists of the following items:

a) p.attributes – Associated attributes to be extracted (note the same pattern can be used to extract several attributes concurrently)

b) p.precondiction – A pre-condition

c) p.match - A regular expression to be matched

d) p.extraction – A regular extraction expression to be used for extraction the values assuming that pattern p.t has been matched

e) p.scope – determine the scope of the extracted values in the text

f) p.store – A SQL query for storing the results in the database

The closing stage of the process is the conﬂict resolution stage It is required for cases

where several possible contradicting patterns can be matched to the same text or the same pattern can be matched to different part of the text

The information extraction process is depicted in Figure 13:

Fig 64.13 The information extraction process

Trang 6

64.3.4 NHECD products

The results of NHECD consist mainly of two products:

1 A repository of scientiﬁc papers related to Nanotox, augmented by metadata provided by authors and publishers, metadata extracted from the papers using text mining algorithms, and ratings for the articles based on methods adopted by NHECD All the above, indexed using NHECD taxonomies As a result, it is possible to retrieve scientiﬁc papers using sophisticated queries

2 A set of structured facts extracted from the scientiﬁc papers in tabular format The

struc-tured facts should make it possible to perform data mining to obtain new, unforeseen knowledge

64.3.5 Scientiﬁc paper rating

A scientiﬁc paper has a well established life cycle After the paper is written, refereed and eventually accepted, it is published From this point in time the paper can be cited

The rating of a paper depends on several variables:

1 Journal Name

2 Publication Year

3 Full Author Names

4 For each citing article:

5 Citing article name (and a unique identiﬁer for the paper itself NHECD decided to adopt SICI16for this purpose)

6 Citing journal name

7 From JCR (Journal Citation Report) , for journal name (including citing journals):

8 Impact Factor

9 Cited Half Life

10 H-Indices17per Author, From PoP

The rating algorithm is applied when the paper is loaded and then on a periodic basis, to reﬂect changes such as new citations, changes in impact factors, in “Cited Half Life”, in JCR data and more The rating algorithm takes into account the publication date of newly published papers to avoid less-than-fair ratings for such papers

The scientiﬁc paper rating devised by NHECD is composed a Journal Impact Factor and

by H-indices These components are deﬁned below

1 Rating By Journal Impact Factor

Rating1(Article(i)) = 1 − 2 −0.6•CitationScore Article (i)

where

Age (Article(i))

and

16http://en.wikipedia.org/wiki/SICI

17http://en.wikipedia.org/wiki/Hirsch number

Trang 7

Map (impact(Journal(Article( j)))) =

⎧

⎨

⎩

2 Rating By H-Indices

Rating2(Article(i)) = 1 − 1.05 −HScore Article (i)

where

HScore Article (i)= ∑Article ( j)∈citations(Article(i)) Average Author (k)∈Article( j) (H − Index k)

Age (Article(i))

3 Final Rating

α1+α2

0≤αi ≤ 1

64.3.6 NHECD Frontend

NHECD provides a free access website including information retrieval functionalities to facil-itate the search on NHECD repository

It includes the following components:

1 An open source content management system implemented on Drupal, which stores and manages the entire frontend database (including user information and usage patterns)

2 The user interface component that handles all the input or requests from the user

The frontend interacts with the backend repository, stored and managed on Documentum Figure 14 shows the architecture design of NHECD Frontend

1 User communities and Characteristics – NHECD front end is designed to meet the differ-ent needs of three main communities and an additional group – the administrators

2 Scientists – Users in this community will be scientists from academia and industry – the most expert users among all three communities These users should have an extensive prior knowledge in the domain of nanotoxicology The system assumes that these users are proﬁcient in information searches

3 Regulators – Users working for (or on behalf of) government institutes and regulatory agencies are part of the NHECD regulatory community This community aims at provid-ing legislation and regulation on the health, safety or environmental concerns regardprovid-ing the use of nano-particles Usage patterns of this group often overlap with those of the other communities

4 General public – This community is composed of individuals and NGO’s who are active

in a wide range of ﬁelds where information provided by NHECD may be relevant We assume that most of the general public users are NOT able to read/evaluate the scientiﬁc material NHECD provides Therefore, the frontend provides for this community

-mainly answers to queries on general information/light reviews or news on the impact of

exposure to nanoparticles

Trang 8

Fig 64.14 Architecture.

5 Administrator – The administrator is in charge of managing the daily operation of the system Administrators are responsible for managing user accounts, general settings and monitoring

The NHECD frontend provides the following features:

1 Basic search

2 Advanced search

3 Intelligent search

4 Taxonomic navigation

5 Recommender results (i.e., recommendations based on the analysis of usage patterns of other users)

6 Option to resubmit queries, adding additional criteria for the reﬁnement of results

7 Site registration

8 Personalization features

9 Displaying a list of most viewed papers

10 Links to other nanotox related sites

11 NHECD news, updates and FAQ’s

64.4 Conclusions

NHECD provides two important products:

1 An extensive and commented repository of scientific papers and other publications in the Nanotox area, searchable using taxonomies and full text search The scientific papers are rated according to published NHECD criteria, to help users to better estimate their findings Such a repository significantly expand currently available repositories due to the fact that it goes beyond the mapping of existing research in Nanotox (as most current initiatives do) NHECD gives access to the research papers results, extracted from the sources using text mining algorithms Access to scientific papers is granted to visitors following copyright and restrictions as imposed by publishers This NHECD result is intended for Nanotox scientists, regulators and for the general public

Trang 9

NHECD 2.0 rating

information extraction

table extraction

graph mining

annotation

metadata

y1 x0

y0

x1

taxonomies

Fig 64.15 NHECD 2.0

2 A set of structured results extracted from the scientiﬁc papers populating the NHECD repository Using these results it will be possible to perform data mining on the results Data mining will result in validated results and further knowledge discovery This part of NHECD results is targeted at Nanotox scientists and regulators

Trang 10

64.5 Further research

Graph and table mining

NHECD makes resort to text mining algorithms, allowing for information extraction from textual data It appears that scientiﬁc Nanotox papers (as in many other areas) often include other type of elements, such as graphs and tables Moreover, the expressiveness of these el-ements is generally higher than that conveyed by text Hence, expanding NHECD to include graph and table mining seems desirable Preliminary research on these subjects made by the NHECD team shows that – at least for some types of graphs and tables – the task is feasible The concept of the future NHECD (touted NHECD 2.0) is shown in Figure 15

Định dạng
Số trang	10
Dung lượng	428,47 KB