64.3.2 Crawling The process of automatically obtaining scientific papers and data about the paper such as the name of the author or authors, the publication date, the name of the journal,
Trang 1Fig 64.10 NHECD model.
64.3 NHECD implementation
NHECD is built around Documentum11, an enterprise content management system (ECM) Documentum acts as a central repository for documents (e.g., unstructured data), metadata (semi structured to structured data, mostly in XML) and extracted information (mostly struc-tured, tabular data)
NHECD assumes that all target scientific papers to be included in the repository are in Adobe PDF format A review of several websites hosting scientific papers shows that this is
a safe choice If other formats are found the crawler can be instructed to convert the format found to PDF almost seamlessly
64.3.1 Taxonomies
Each taxonomy deals with a certain aspect of the Nanotox domain They were built by teams of domain experts and information management experts Taxonomies are not an expected output
of NHECD Yet, they are essential to the NHECD process Hence it was one of the initial steps
in NHECD implementation The taxonomies are:
Following are the taxonomies describing the subject “commercial NP characterization”
64.3.2 Crawling
The process of automatically obtaining scientific papers and data about the paper (such as the name of the author or authors, the publication date, the name of the journal, keywords, abstract,
11http://www.emc.com/products/family/documentum-family.htm
Trang 2NHECD rating
information extraction
annotation
metadata
y1 x0
y0
x1
taxonomies
Fig 64.11 NHECD process
Fig 64.12 Commercial characterization of NP
Trang 3Table 64.1 Taxonomies.
Subject Taxonomies
Species
Metabolism
Distribution
effecting agents
Article
Electron beam methods
Optical methods
and in general any detail made available with the paper itself) by visiting scientific paper repositories available on the web (whether restricted to subscribers or available to everyone)
and searching by keywords on the paper text is called crawling NHECD developed a crawler
for Pubmed12 Crawlers for other leading scientific sites, such as, ISIWEB13or SciFinder
12http://www.ncbi.nlm.nih.gov/pubmed/
13http://www.webofknowledge.com/
Trang 414are in a development stage The main obstacles found often refer to intellectual property issues and the efforts by the publishers to enforce it
The crawler is written in java It takes as input a set of keywords Using websites API15
it obtains a list of pointers to the targeted scientific papers Those pointers are processed to
transform it into downloadable links If a downloadable link is obtained, the scientific paper
is downloaded, provided that NHECD has access to the paper (e.g., there is a subscription to the resource or it is publicly available) The paper (if available, otherwise its place holder), along with the metadata already converted to XML, are uploaded to the NHECD document repository
64.3.3 Information extraction
The goals of information extraction in NHECD are:
1 To enable users to ask specific questions about specific attributes and receive answers If possible, a link to the paper is given, along with a pointer to the location of the requested information within the document
2 To enable, in the future, data mining on extracted data (e.g., patterns)
The process starts with a multistep preprocessing stage:
1 convert the input documents from PDF to text
2 perform parsing and stemming
3 perform zoning within the document
4 classify the document according to NHECD taxonomies
Next in the process is the tagging stage, used to recognize keywords, either by using the
taxonomies or the values involved As an example, the input phrase
“To determine the effect of particle size, labeled microspheres of 500 and 1000
nm in diameter were incubated with mouse melanoma B16 cells”
would result in the tagged form
“To determine the effect of particle size, labeled microspheres of <NUMBER_1> and <NUMBER_2> <LENGTH-UNIT_3> in diameter were incubated with
<SPECIES_4> <CELL-TYPE_5> <CELL-LINE_6> cells”
The pattern matching stage is based on the output of previous stages and on the process
of annotation, an auxiliary step performed by Nanotox domain experts to prepare a training set for this stage
The tasks needed to obtain patterns are:
1 Define the list of features to be extracted (based on the taxonomy)
14http://pubs.acs.org/
15Application Program Interface
Trang 52 For each feature that needed to be extracted we define a list of extraction patterns
3 Each extraction pattern (p) consists of the following items:
a) p.attributes – Associated attributes to be extracted (note the same pattern can be used to extract several attributes concurrently)
b) p.precondiction – A pre-condition
c) p.match - A regular expression to be matched
d) p.extraction – A regular extraction expression to be used for extraction the values assuming that pattern p.t has been matched
e) p.scope – determine the scope of the extracted values in the text
f) p.store – A SQL query for storing the results in the database
The closing stage of the process is the conflict resolution stage It is required for cases
where several possible contradicting patterns can be matched to the same text or the same pattern can be matched to different part of the text
The information extraction process is depicted in Figure 13:
Fig 64.13 The information extraction process
Trang 664.3.4 NHECD products
The results of NHECD consist mainly of two products:
1 A repository of scientific papers related to Nanotox, augmented by metadata provided by authors and publishers, metadata extracted from the papers using text mining algorithms, and ratings for the articles based on methods adopted by NHECD All the above, indexed using NHECD taxonomies As a result, it is possible to retrieve scientific papers using sophisticated queries
2 A set of structured facts extracted from the scientific papers in tabular format The
struc-tured facts should make it possible to perform data mining to obtain new, unforeseen knowledge
64.3.5 Scientific paper rating
A scientific paper has a well established life cycle After the paper is written, refereed and eventually accepted, it is published From this point in time the paper can be cited
The rating of a paper depends on several variables:
1 Journal Name
2 Publication Year
3 Full Author Names
4 For each citing article:
5 Citing article name (and a unique identifier for the paper itself NHECD decided to adopt SICI16for this purpose)
6 Citing journal name
7 From JCR (Journal Citation Report) , for journal name (including citing journals):
8 Impact Factor
9 Cited Half Life
10 H-Indices17per Author, From PoP
The rating algorithm is applied when the paper is loaded and then on a periodic basis, to reflect changes such as new citations, changes in impact factors, in “Cited Half Life”, in JCR data and more The rating algorithm takes into account the publication date of newly published papers to avoid less-than-fair ratings for such papers
The scientific paper rating devised by NHECD is composed a Journal Impact Factor and
by H-indices These components are defined below
1 Rating By Journal Impact Factor
Rating1(Article(i)) = 1 − 2 −0.6•CitationScore Article (i)
where
Age (Article(i))
and
16http://en.wikipedia.org/wiki/SICI
17http://en.wikipedia.org/wiki/Hirsch number
Trang 7Map (impact(Journal(Article( j)))) =
⎧
⎨
⎩
2 Rating By H-Indices
Rating2(Article(i)) = 1 − 1.05 −HScore Article (i)
where
HScore Article (i)= ∑Article ( j)∈citations(Article(i)) Average Author (k)∈Article( j) (H − Index k)
Age (Article(i))
3 Final Rating
α1+α2
0≤αi ≤ 1
64.3.6 NHECD Frontend
NHECD provides a free access website including information retrieval functionalities to facil-itate the search on NHECD repository
It includes the following components:
1 An open source content management system implemented on Drupal, which stores and manages the entire frontend database (including user information and usage patterns)
2 The user interface component that handles all the input or requests from the user
The frontend interacts with the backend repository, stored and managed on Documentum Figure 14 shows the architecture design of NHECD Frontend
1 User communities and Characteristics – NHECD front end is designed to meet the differ-ent needs of three main communities and an additional group – the administrators
2 Scientists – Users in this community will be scientists from academia and industry – the most expert users among all three communities These users should have an extensive prior knowledge in the domain of nanotoxicology The system assumes that these users are proficient in information searches
3 Regulators – Users working for (or on behalf of) government institutes and regulatory agencies are part of the NHECD regulatory community This community aims at provid-ing legislation and regulation on the health, safety or environmental concerns regardprovid-ing the use of nano-particles Usage patterns of this group often overlap with those of the other communities
4 General public – This community is composed of individuals and NGO’s who are active
in a wide range of fields where information provided by NHECD may be relevant We assume that most of the general public users are NOT able to read/evaluate the scientific material NHECD provides Therefore, the frontend provides for this community
-mainly answers to queries on general information/light reviews or news on the impact of
exposure to nanoparticles
Trang 8Fig 64.14 Architecture.
5 Administrator – The administrator is in charge of managing the daily operation of the system Administrators are responsible for managing user accounts, general settings and monitoring
The NHECD frontend provides the following features:
1 Basic search
2 Advanced search
3 Intelligent search
4 Taxonomic navigation
5 Recommender results (i.e., recommendations based on the analysis of usage patterns of other users)
6 Option to resubmit queries, adding additional criteria for the refinement of results
7 Site registration
8 Personalization features
9 Displaying a list of most viewed papers
10 Links to other nanotox related sites
11 NHECD news, updates and FAQ’s
64.4 Conclusions
NHECD provides two important products:
1 An extensive and commented repository of scientific papers and other publications in the Nanotox area, searchable using taxonomies and full text search The scientific papers are rated according to published NHECD criteria, to help users to better estimate their findings Such a repository significantly expand currently available repositories due to the fact that it goes beyond the mapping of existing research in Nanotox (as most current initiatives do) NHECD gives access to the research papers results, extracted from the sources using text mining algorithms Access to scientific papers is granted to visitors following copyright and restrictions as imposed by publishers This NHECD result is intended for Nanotox scientists, regulators and for the general public
Trang 9NHECD 2.0 rating
information extraction
table extraction
graph mining
annotation
metadata
y1 x0
y0
x1
taxonomies
Fig 64.15 NHECD 2.0
2 A set of structured results extracted from the scientific papers populating the NHECD repository Using these results it will be possible to perform data mining on the results Data mining will result in validated results and further knowledge discovery This part of NHECD results is targeted at Nanotox scientists and regulators
Trang 1064.5 Further research
Graph and table mining
NHECD makes resort to text mining algorithms, allowing for information extraction from textual data It appears that scientific Nanotox papers (as in many other areas) often include other type of elements, such as graphs and tables Moreover, the expressiveness of these el-ements is generally higher than that conveyed by text Hence, expanding NHECD to include graph and table mining seems desirable Preliminary research on these subjects made by the NHECD team shows that – at least for some types of graphs and tables – the task is feasible The concept of the future NHECD (touted NHECD 2.0) is shown in Figure 15