Machine learning and network methods for biology and medicine

Một tài liệu hay về machine learning and network method trong sinh học và y học. Sách là tập hợp các bài báo cáo về các ứng dụng machine learning trong lĩnh vực y học. gồm 18 ứng dụng trong lĩnh vực như di truyền học, ung thư học, sinh học phân tử, xét nghiệm. Để đọc tài liệu này chúng ta cần có kiến thức cơ bản về machine learning. Tài liệu cần thiết cho IT làm trong lĩnh vực y tế

Trang 1

Computational and Mathematical Methods in Medicine

Machine Learning and Network

Methods for Biology and MedicineGuest Editors: Lei Chen, Tao Huang, Chuan Lu, Lin Lu, and Dandan Li

Trang 2

Biology and Medicine

Trang 3

Computational and Mathematical Methods in Medicine

Machine Learning and Network Methods for Biology and Medicine

Guest Editors: Lei Chen, Tao Huang, Chuan Lu, Lin Lu, and Dandan Li

Trang 4

tributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Trang 5

Editorial Board

Emil Alexov, USA

Elena Amato, Italy

Konstantin G Arbeev, USA

Georgios Archontis, Cyprus

Paolo Bagnaresi, Italy

Enrique Berjano, Spain

Elia Biganzoli, Italy

Konstantin Blyuss, UK

Hans A Braun, Germany

Thomas S Buchanan, USA

Zoran Bursac, USA

Thierry Busso, France

Xueyuan Cao, USA

Carlos Castillo-Chavez, USA

Prem Chapagain, USA

Hsiu-Hsi Chen, Taiwan

Ming-Huei Chen, USA

Phoebe Chen, Australia

Wai-Ki Ching, Hong Kong

Nadia A Chuzhanova, UK

Maria Cordeiro, Portugal

Irena Cosic, Australia

Fabien Crauste, France

William Crum, UK

Getachew Dagne, USA

Qi Dai, China

Chuangyin Dang, Hong Kong

Justin Dauwels, Singapore

Didier Delignières, France

Jun Deng, USA

Thomas Desaive, Belgium

David Diller, USA

Michel Dojat, France

Irini Doytchinova, Bulgaria

Esmaeil Ebrahimie, Australia

Georges El Fakhri, USA

Issam El Naqa, USA

Angelo Facchiano, Italy

Luca Faes, Italy

Giancarlo Ferrigno, Italy

Marc Thilo Figge, Germany

Alfonso T García-Sosa, Estonia

Amit Gefen, Israel

Humberto González-Díaz, SpainIgor I Goryanin, Japan

Marko Gosak, SloveniaDamien Hall, AustraliaStavros J Hamodrakas, GreeceVolkhard Helms, GermanyAkimasa Hirata, JapanRoberto Hornero, SpainTingjun Hou, ChinaSeiya Imoto, JapanSebastien Incerti, FranceAbdul Salam Jarrah, UAEHsueh-Fen Juan, TaiwanRafik Karaman, PalestineLev Klebanov, Czech RepublicAndrzej Kloczkowski, USAXiang-Yin Kong, ChinaZuofeng Li, USAChung-Min Liao, TaiwanQuan Long, UK

Ezequiel López-Rubio, SpainReinoud Maex, FranceValeri Makarov, SpainKostas Marias, GreeceRichard J Maude, ThailandPanagiotis Mavroidis, USAGeorgia Melagraki, GreeceMichele Migliore, ItalyJohn Mitchell, UKChee M Ng, USAMichele Nichelatti, ItalyErnst Niebur, USAKazuhisa Nishizawa, JapanHugo Palmans, UKFrancesco Pappalardo, ItalyMatjaz Perc, SloveniaEdward J Perkins, USAJesús Picó, SpainAlberto Policriti, ItalyGiuseppe Pontrelli, ItalyChristopher Pretty, New ZealandMihai V Putz, Romania

Ravi Radhakrishnan, USA

David G Regan, AustraliaJosé J Rieta, SpainJan Rychtar, USAMoisés Santillán, MexicoVinod Scaria, IndiaJörg Schaber, Germany

Xu Shen, ChinaSimon A Sherman, USAPengcheng Shi, USATieliu Shi, ChinaErik A Siegbahn, SwedenSivabal Sivaloganathan, CanadaDong Song, USA

Xinyuan Song, Hong KongEmiliano Spezi, UKGreg M Thurber, USATianhai Tian, AustraliaTianhai Tian, AustraliaJerzy Tiuryn, PolandNestor V Torres, SpainNelson J Trujillo-Barreto, UKAnna Tsantili-Kakoulidou, GreecePo-Hsiang Tsui, Taiwan

Gabriel Turinici, FranceEdelmira Valero, SpainRaoul van Loon, UKLuigi Vitagliano, ItalyLiangjiang Wang, USARuiqi Wang, ChinaRuisheng Wang, USADavid A Winkler, AustraliaGabriel Wittum, Germany

Yu Xue, ChinaYongqing Yang, ChinaChen Yanover, IsraelXiaojun Yao, ChinaKaan Yetilmezsoy, TurkeyHujun Yin, UK

Hiro Yoshida, USAHenggui Zhang, UKYuhai Zhao, ChinaXiaoqi Zheng, ChinaYunping Zhu, China

Trang 6

Machine Learning and Network Methods for Biology and Medicine, Lei Chen, Tao Huang, Chuan Lu,Lin Lu, and Dandan Li

Volume 2015, Article ID 915124, 2 pages

Detection of Dendritic Spines Using Wavelet-Based Conditional Symmetric Analysis and Regularized Morphological Shared-Weight Neural Networks, Shuihua Wang, Mengmeng Chen, Yang Li,

Yudong Zhang, Liangxiu Han, Jane Wu, and Sidan Du

An Overview of Biomolecular Event Extraction from Scientific Documents, Jorge A Vanegas,

Sérgio Matos, Fabio González, and José L Oliveira

NMFBFS: A NMF-Based Feature Selection Method in Identifying Pivotal Clinical Symptoms of

Hepatocellular Carcinoma, Zhiwei Ji, Guanmin Meng, Deshuang Huang, Xiaoqiang Yue, and Bing WangVolume 2015, Article ID 846942, 12 pages

Comparative Transcriptomes and EVO-DEVO Studies Depending on Next Generation Sequencing,Tiancheng Liu, Lin Yu, Lei Liu, Hong Li, and Yixue Li

ROC-Boosting: A Feature Selection Method for Health Identification Using Tongue Image, Yan Cui,Shizhong Liao, and Hongwu Wang

A Five-Gene Signature Predicts Prognosis in Patients with Kidney Renal Clear Cell Carcinoma,

Yueping Zhan, Wenna Guo, Ying Zhang, Qiang Wang, Xin-jian Xu, and Liucun Zhu

Survey of Natural Language Processing Techniques in Bioinformatics, Zhiqiang Zeng, Hua Shi, Yun Wu,and Zhiling Hong

A Systematic Evaluation of Feature Selection and Classification Algorithms Using Simulated and Real miRNA Sequencing Data, Sheng Yang, Li Guo, Fang Shao, Yang Zhao, and Feng Chen

Identification of Chemical Toxicity Using Ontology Information of Chemicals, Zhanpeng Jiang, Rui Xu,and Changchun Dong

An Improved PID Algorithm Based on Insulin-on-Board Estimate for Blood Glucose Control with Type

1 Diabetes, Ruiqiang Hu and Chengwei Li

G2LC: Resources Autoscaling for Real Time Bioinformatics Applications in IaaS, Rongdong Hu,

Guangming Liu, Jingfei Jiang, and Lixin Wang

Trang 7

Identifying New Candidate Genes and Chemicals Related to Prostate Cancer Using a Hybrid Network and Shortest Path Approach, Fei Yuan, You Zhou, Meng Wang, Jing Yang, Kai Wu, Changhong Lu,

Xiangyin Kong, and Yu-Dong Cai

Identifying Novel Candidate Genes Related to Apoptosis from a Protein-Protein Interaction Network,Baoman Wang, Fei Yuan, Xiangyin Kong, Lan-Dian Hu, and Yu-Dong Cai

Cell Pluripotency Levels Associated with Imprinted Genes in Human, Liyun Yuan, Xiaoyan Tang,Binyan Zhang, and Guohui Ding

A Model of Regularization Parameter Determination in Low-Dose X-Ray CT Reconstruction Based on Dictionary Learning, Cheng Zhang, Tao Zhang, Jian Zheng, Ming Li, Yanfei Lu, Jiali You, and Yihui GuanVolume 2015, Article ID 831790, 12 pages

Multivariate Radiological-Based Models for the Prediction of Future Knee Pain: Data from the OAI,Jorge I Galván-Tejada, José M Celaya-Padilla, Victor Treviño, and José G Tamez-Peña

Nonsynonymous Single-Nucleotide Variations on Some Posttranslational Modifications of Human Proteins and the Association with Diseases, Bo Sun, Menghuan Zhang, Peng Cui, Hong Li, Jia Jia, Yixue Li,and Lu Xie

KIR Genes and Patterns Given by the A Priori Algorithm: Immunity for Haematological Malignancies,

J Gilberto Rodríguez-Escobedo, Christian A García-Sepúlveda, and Juan C Cuevas-Tello

Trang 8

Machine Learning and Network Methods for

Biology and Medicine

1 College of Information Engineering, Shanghai Maritime University, Shanghai 201306, China

2 Department of Genetics and Genomics Sciences, Mount Sinai School of Medicine, New York, NY 10029, USA

3 Institute of Health Sciences, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai 200031, China

4 Department of Computer Science, Aberystwyth University, Aberystwyth, Ceredigion SY23 3DB, UK

5 Department of Radiology, Columbia University Medical Center, New York, NY 10032, USA

6 Gastrointestinal Medical Department, China-Japan Union Hospital of Jilin University, Changchun 130033, China

Correspondence should be addressed to Lei Chen; chen lei1@163.com

Received 12 October 2015; Accepted 12 October 2015

Copyright © 2015 Lei Chen et al This is an open access article distributed under the Creative Commons Attribution License, whichpermits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited

In recent years, many computational methods have been

proposed to tackle the problems that arise in analyzing

various large-scale high dimensional data in biology and

medicine Useful techniques have been developed by the use

of conventional statistical modeling and analysis and have

helped to reveal many biological mechanisms However, with

the rapid development of high throughput technologies,

bio-logical and medical data generated nowadays are becoming

increasingly more heterogeneous and complex It is therefore

necessary to develop more effective and efficient approaches

to analyzing such data, requiring more powerful methods like

advanced machine learning algorithms and network based

methods

In this special issue, eighteen novel investigations are

presented, including a number of newly proposed techniques

for up-to-date data analysis and application systems for

interesting biological and medical problems

A computational method was proposed by B Wang et

al to identify novel candidate genes related to apoptosis

This method first applied shortest path algorithm in a large

protein-protein interaction network to search new candidate

genes and then the candidate genes were filtered by a

per-mutation test Twenty-six genes were obtained and analyzed

regarding their likelihood of being novel apoptosis-related

genes

F Yuan et al proposed a computational method to tify new candidate genes and chemicals based on currentlyknown genes and chemicals related to prostate cancer

iden-by applying shortest path approach in a hybrid networkwhich was constructed according to information concerningchemical-chemical interactions, chemical-protein interac-tions, and protein-protein interactions

B Sun et al designed an analysis pipeline to studythe relationships between eight types of damaging proteinposttranslational modifications (PTM) and a few humaninherited diseases and cancers The results suggested thatsome human inherited diseases or cancers might be related

to the interactions of damaging PTMs

Y Zhan et al identified a five-gene signature that predictsprognosis in patients with kidney renal clear cell carcinoma(KIRC) The RNA expression data from RNA-sequencing andclinical information of 523 KIRC patients were analyzed TheAUC (area under ROC curve) of the five-gene signature was0.783 which showed high sensitivity and specificity

Z Ji et al developed a Nonnegative Matrix tion (NMF) based feature selection approach (NMFBFS)

Factoriza-to identify potential clinical sympFactoriza-toms for HCC patientstratification The results on 407 HCC patient samples with 57symptoms showed the effectiveness of the NMFBFS approach

in identifying important clinical features, which will be veryhelpful for HCC diagnosis

http://dx.doi.org/10.1155/2015/915124

Trang 9

2 Computational and Mathematical Methods in Medicine

C Zhang et al proposed adaptive weight regularized

ADSIR for low dose CT reconstruction Three numerical

experiments are carried out for evaluation and comparisons

are made with other algorithms

J I Galv´an-Tejada et al presented the potential of

X-ray based multivariate prognostic models to predict the

onset of chronic knee pain Using X-rays quantitative

image-assessments, multivariate models may be used to predict

sub-jects that are at risk of developing knee pain by osteoarthritis

Y Cui et al developed a method called ROC-Boosting

to select significant Haar-like features extracted from tongue

images for health identification They analyzed the images of

1,322 tongue cases and selected features focused on the root,

top, and side areas of the tongue which can classify the healthy

and ill cases

S Wang et al proposed a novel automatic approach for

dendritic spine identification in neuron image The method

integrated wavelet based conditional symmetric analysis and

regularized morphological shared-weight neural networks

Its good performance and the comparison with existing

methods suggest the utility of the method

S Yang et al proposed the use of a combination of edgeR

and DESeq to analyze miRNA sequencing data with a large

sample size

R Hu et al proposed an automated resource provisioning

method, G2LC, for bioinformatics applications in IaaS It

guaranteed applications performance and improved resource

utilization Evaluated on real sequence searching data of

BLAST, G2LC saved up to 20.14% of resource

R Hu and C Li proposed an improved PID algorithm

based on insulin-on-board estimate using a combinational

mathematical model of the dynamics of blood

glucose-insulin regulation in the blood system The simulation results

demonstrated that the improved PID algorithm can perform

well in different carbohydrate ingestion and different insulin

sensitivity situations Compared with the traditional PID

algorithm, the control performance was improved obviously

and hypoglycemia can be avoided

J G Rodriguez-Escobedo et al described the use of the “a

priori” algorithm at resolving KIR gene patterns associated

with haematological malignancies, previously unrevealed

through traditional statistical approaches

Z Jiang et al built a new method to predict

chemi-cal toxicities based on ontology information of chemichemi-cals

This method was more effective than previous method and

provided new insights to study chemical toxicity and other

attributes of chemicals

L Yuan et al explored the hidden relationship between

miRNAs and imprinted genes in cell pluripotency They

found that the neighbors of imprinted genes on molecular

network were enriched in modules such as cancer, cell death

and survival, and tumor morphology The imprinted region

may provide a new look for those who are interested in cell

pluripotency of hiPSCs and hESCs

T Liu et al reviewed the recent discoveries and advance

in the field of evolutional developmental biology in light of

the development in large-scale omics studies

J A Vanegas et al presented a survey on the

state-of-the-art text mining approaches to extraction of biomolecular

events, which are useful for understanding the underlyingbiological mechanisms The popular natural language pro-cessing and machine learning methods and tools have beenanalyzed for this task of phases varied from feature extraction,trigger/edge detection to postprocessing

Z Zeng et al surveyed natural language processing niques in bioinformatics First, they searched for knowledge

tech-on biology and retrieved references using text mining ods and reconstructed databases Then, they analyzed theapplications of text mining and natural language processingtechniques in bioinformatics Finally, numerous methods andapplications are discussed for future use by text mining andnatural language processing researchers

meth-In summary, this special issue collects a number ofinnovative studies that address various challenging issues

in analyzing data in biology and medicine We hope thatthis publication will become a landmark in the internationaldevelopment of the relevant literature and also will helpencourage more researchers and practitioners to be engaged

in this ever increasingly important field

Lei Chen Tao Huang Chuan Lu Lin Lu Dandan Li

Trang 10

Research Article

Detection of Dendritic Spines Using Wavelet-Based

Conditional Symmetric Analysis and Regularized Morphological Shared-Weight Neural Networks

1 Department of Electronic Engineering, Nanjing University, Nanjing 210024, China

2 School of Computer Science and Technology, Nanjing Normal University, Nanjing 210023, China

3 State Key Laboratory of Brain and Cognitive Science, Institute of Biophysics, Chinese Academy of Sciences, Beijing 100101, China

4 Department of Neurology, Lurie Cancer Center, Center for Genetic Medicine, Northwestern University School of Medicine,

Chicago, IL 60611, USA

5 University of Chinese Academy of Sciences, Beijing 100101, China

6 Translational Imaging Division, Columbia University, New York, NY 10032, USA

7 School of Computing, Mathematics and Digital Technology, Manchester Metropolitan University, Manchester M1 5GD, UK

Correspondence should be addressed to Sidan Du; coff128@nju.edu.cn

Received 17 June 2015; Revised 2 September 2015; Accepted 27 September 2015

Academic Editor: Valeri Makarov

Copyright © 2015 Shuihua Wang et al This is an open access article distributed under the Creative Commons Attribution License,which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.Identification and detection of dendritic spines in neuron images are of high interest in diagnosis and treatment of neurologicaland psychiatric disorders (e.g., Alzheimer’s disease, Parkinson’s diseases, and autism) In this paper, we have proposed a novelautomatic approach using wavelet-based conditional symmetric analysis and regularized morphological shared-weight neuralnetworks (RMSNN) for dendritic spine identification involving the following steps: backbone extraction, localization of dendriticspines, and classification First, a new algorithm based on wavelet transform and conditional symmetric analysis has been developed

to extract backbone and locate the dendrite boundary Then, the RMSNN has been proposed to classify the spines into threepredefined categories (mushroom, thin, and stubby) We have compared our proposed approach against the existing methods.The experimental result demonstrates that the proposed approach can accurately locate the dendrite and accurately classify thespines into three categories with the accuracy of 99.1% for “mushroom” spines, 97.6% for “stubby” spines, and 98.6% for “thin”spines

1 Introduction

Dendritic spines are small “doorknob” shaped extensions

from neuron’s dendrites, which can number thousands to

a single neuron Spines are typically classified into three

types based on the shape information: mushroom, stubby,

and thin “Mushroom” spine has a bulbous head with a

thin neck; “stubby” spine only has a bulbous head; “thin”

spine has a long thin neck with a small head Research has

shown that the changes in shape, length, and size of dendritic

spines are closely linked with neurological and psychiatric

disorders, such as attention-deficit hyperactivity disorder(ADHD), autism, intellectual disability, Alzheimer’s disease,and Parkinson’s disease [1–5] Therefore, the morphologyanalysis and identification of structure of dendritic spines arecritical for diagnosis and further treatment of these diseases[6, 7]

Traditional manual detection approach of dendriticspines detection is costly and time consuming and prone toerror due to human subjectiveness With the recent advances

in biomedical imaging, computer-aided semiautomatic orautomatic approaches to detect dendritic spines based onhttp://dx.doi.org/10.1155/2015/454076

Trang 11

image analysis have shown the efficacy SynD method

pro-posed by Schmitz et al [8] is a semiautomatic image analysis

routine to analyze dendrite and synapse characteristics in

immune-fluorescence images For the fluorescence

imag-ing, the neurite and soma were captured in the separated

imaging channels In that case, soma and synapse were

detected without intervention from neurite [9–11] based on

the channel information However, this method cannot be

extended to the images, of which the information is

cap-tured in the same channel Therefore, many other methods

were proposed to solve this problem, for instance, ImageJ

[12], NeuronStudio [13], NeuronJ [14], and NeuronIQ [15]

However, these methods have some limitations For

exam-ple, NeuronIQ was designed for the confocal multiphoton

laser scanning NeuronJ was used to trace the dendrite

growing in the condition of manually marking the dendrite

first Koh et al detected spines from stacks of image data

obtained by laser scanning microscopy [16] The algorithm

first extracted the dendrite backbone defined as the medial

axis and then geometric information was employed to detect

the attached and detached spines according to the shape of

each candidate spine region Features including spine length,

volume, density, and shape for static and time-lapse images

of hippocampal pyramidal neurons were used as key points

for the detection The disadvantage of this method is that

it might lose many spines during the detection because of

the thresholding method used in this case To overcome

this problem, Xu et al proposed a new detection algorithm

for the attached spines from the dendrites by two grassfire

steps [17]: a global threshold was chosen to segment the

image and then the medial axis transform (MAT) was applied

to find the centerlines of the dendrites Then some large

spines (noncenterlines) were removed from the centerlines

After the backbone was extracted, two grassfire procedures

were applied to separate the spine and dendrite The results

of the proposed method were similar to the results of the

manual method Cheng et al proposed a method using an

adaptive threshold based on the local contrast to determine

the foreground, containing the spine and dendrite, and

detect attached and detached spines [18] Fan et al used

the curvilinear structure detector to find the medial axis of

the dendrite backbone and spines attached to the backbone

[19] To locate the boundary of dendrite, an adaptive local

binary fitting (aLBF) energy level set model was proposed

for localization Zhang et al extracted the boundaries and

the centerlines of the dendrite by estimating the second-order

directional derivatives for both the dendritic backbones and

spines [20] Then a classifier based on Linear Discriminate

Analysis (LDA) was built to classify the attached spines

into true and false types The accuracy of the algorithm

was calculated according to the backbone length, spine

number, spine length, and spine density Janoos et al used

the medial geodesic to extract the centerlines of the dendritic

backbone [21] He et al proposed a method based on NDE to

classify the dendrite and spines [22] The principle of their

method was that spine and dendrite had different shrink

rates Shi et al proposed a wavelet-based supervised method

for classifying 3D dendritic spines from neuron images

(1) A new extraction model for dendrite backbone andits boundary localization using wavelet-based condi-tional symmetric analysis and pixel intensity differ-ence, which can allow accurate extraction of back-bone, the first important step for dendritic spines.(2) A new way for spine detection based on regular-ized morphological shared-weight neural networks(RMSNN) to efficiently detect spines and classifythem into right categories, that is, mushroom, thin,and stubby

The rest of this paper is organized as follows Section 2describes the proposed methods including wavelet-basedconditional symmetry analysis and pixel intensity differencefor the dendrite detection and localization and regularizedshared-weight neural networks for the spine detection InSection 3, we have conducted experimental evaluation anddemonstrated the effectiveness of the proposed algorithm.Section 4 discusses the results Section 5 concludes the pro-posed approach and highlights the future work

2 Methods

Figure 1 shows the steps of our proposed approach to dritic spines In the image acquisition phase, we demon-strated the process for the neuron culture, label, and imaging

den-In the second step, we preprocessed the images by reducingthe noise and smoothing the background [24, 25] Then, weextracted the dendrite backbone based on the conditionalsymmetric analysis and located the dendrite boundary based

on the difference of the pixel intensity Afterwards, the spineswere detected, classified, and characterized by RMSNN

2.1 Image Acquisition The neurons used for imaging in

this paper were cortical neurons, primary cultured fromEmbryonic 18th- (E18-) day rat and next cultured until the22nd day in vitro Then, the neurons were transfected byLipofectamine 2000 and imaged at the 24th day by LeicaSP5 confocal laser scanning microscopy (CLSM) by 63x.The size of the image is 1024 × 1024, and the resolution

is 0.24 um/pixel at the confocal layer The images used forthe morphology analysis were obtained by the maximumintensity projection (MIP) of the original 3D image stack Asthe images were captured as Z-stack series, we projected the3D image stack onto the𝑥𝑦, 𝑦𝑧, and 𝑧𝑥 planes, respectively.Since the slices along the optical direction (𝑧) provided verylimited information and the computation time based on the3D image stacks is highly increased, it was desired to consideronly the 2D projection onto the 𝑥𝑦 plane The 2D imageused for analysis was a maximum intensity projection of

Trang 12

Embryonic (E18) rat

Primary cultured cortical neurons

Transfected (22nd day)

by Lipofectamine 2000

Imaging (24th day) by Leica SP5 (CLSM) by 63x

Image acquisition phase

Noise reduction, background smooth

Backbone extraction

Boundary location

Spine extraction

Spine classification

Spine characterization

Dendrite location phase

Spine analysis phase

Figure 1: Flowchart of the proposed detection method of the dendritic spines

the original 3D stack It was obtained by projecting in the𝑥𝑦

plane the voxels with maximum intensity values that fall in

the way of parallel rays traced from the viewpoint to the plane

of projection

We randomly selected 15 different images from Leica SP5

confocal laser scanning microscopy to form the spines library

to test our algorithm All images contain distinct spines

including mushroom, stubby, and thin types The typical size

of the image is1024 × 1024 Most spines in the images are

within a rectangle of20 × 20 in pixel, but the “thin” spine

is within an about 5 × 20 rectangle in pixel The spines

have variable gray-level intensities Spines collected from the

image library were employed to build an image base library

Spine subimages in the library were taken as samples to

test the classification accuracy of RMSNN In order to cover

as many cases as possible, the image base library contains

distinct sizes and spines with different orientations

In order to build the golden-standard spine library, five

experts in the neuroscience field were employed to manually

mark the spines in the collected images and classify the spines

into three predefined categories including “mushroom,”

“stubby,” and “thin” types For the conflict of the manual

marking, the minority was supposed to be subordinated to

the major Then according to the marked spines, we computed

the maximum width, length, area, and the center point The

randomly selected image base library contains about 2700

subimage samples, 900 for each type of spines Figure 2 shows

some image samples in our image base library As we can see

from the image sample, spines of “mushroom” type contain a

thin neck and head, the stubby type connects directly with the

dendrite without neck, and the thin type is with the smallest

size with only a thin neck and without head

2.2 Image Preprocessing Considering the limitation of

imag-ing technique, we have employed the 2D median filter to

deal with the noise introduced by the imaging mechanism of

the photomultiplier tubes (PMT) and then used the partial

(a) Mushroom

(b) Stubby

(c) ThinFigure 2: Samples of the subimages used in the image library

differential equation (PDE) proposed by Wang et al [26] toenhance the image Figure 3 shows an example of the originalimage and the preprocessed result

2.3 Backbone Extraction Using the Wavelet tion Based Conditional Symmetric Analysis Considering the

Transforma-attached spines, it is necessary to firstly locate the dendrites inorder to segment the spines from the dendrite The backboneextraction and boundary localization are critical for dendriticspine classification and analysis, which include the followingsteps

Step 1 Remove the noise and small isolated point-set Step 2 Locate the backbone of the dendrite.

Step 3 Locate the boundary of the dendrite.

The backbone is defined as the thinning of the dendrite.Due to the variance of width of dendrite, attached anddetached spines, it is a challenging task to locate the boundary

Trang 13

(a) Original image (b) Preprocessed image

Figure 3: An example of preprocessed image

of the dendrite directly from the preprocessed images

There-fore, we have developed a new extraction model utilizing

wavelet transform based conditional symmetric analysis The

essence of this model is to conduct a local conditional

symmetry analysis of the contour of the region of interest

(ROI) and then compute the center points to produce the

backbone of the dendrite

Due to the complexity of the dendrites and dendrite

spines’ distribution, we have employed morphological

oper-ation to remove the small isolated point-set for the dendrite

in the binary image obtained by local Otsu [27–29] via (1),

which could decrease the disconnection rate of the dendrite

in which𝑛 is the threshold of the number of positive pixels

The value of𝑛 could be determined by trial and error method

and means that the pixel belongs to the major line if there

are more than𝑛 positive pixels in its 3 × 3 neighborhood

window Otherwise, the value of the pixel is forced to be

0, treated as the small isolated point-set The determination

of the centerline of the dendrite is based on the conditional

symmetric analysis

The symmetric analysis was accomplished via the wavelet

transform We have applied the wavelet transform to detect a

pair of contour curves:

in which 𝑥 and 𝑦 stand for the coordinate of the contour

curve.𝜑𝑥(𝑥, 𝑦) means the partial derivative of 𝑥 and 𝜑𝑦(𝑥, 𝑦)

stands for the partial derivative of𝑦, respectively 𝜃(𝑥, 𝑦) is alow pass filter

For 𝜑𝑥(𝑥, 𝑦) and 𝜑𝑦(𝑥, 𝑦), the scale wavelet transform(WT) could be written as the following equations:

We selected (7) as the basis function We set 𝜑−(𝑥) =

−𝜑+(−𝑥) and had 𝜑(𝑥) = 𝜑+(𝑥) + 𝜑−(𝑥) as the waveletfunction, which had the following properties: gray invariant,slope invariant, width invariant, and symmetric [29, 30] Theadvantage is to make the extraction of a pair of contours withaccurate protrusions Consider

Trang 14

2𝑥(√1 − 16𝑥2− 3√9 − 16𝑥2+ 8√1 − 𝑥2)) , 𝑥 ∈ (0,

1

4)2

The distance between two symmetric points is equal to

the scale of the wavelet transform If the distance between

two symmetric points is larger than or equal to the width of

regular region, the center point of the symmetric pair can

potentially be located outside of the dendrite The regular

region is defined as the dendrite is smooth, where the

function has a stable variation along the axis Thus, we defined

the stable symmetry as follows

If the scale of wavelet transform is larger than or equal

to the width of regular region, the modulus maxima points

generate two new parallel contours inside the periphery of the

dendrite All the symmetric pairs of the wavelet transforms

that do not have a counterpart are defined as the unstable

symmetry In this case, we have considered the width as the

constraint condition In the direction of the perpendicular to

the gradient direction, we selected the width nearest to the

regular region

The center of every symmetric pair located on the

centerline of the original regular region of the stroke point

Finally, the backbone of the regular region was defined by the

curve of all connected symmetric points

2.4 Boundary Location Based on the Pixel Intensity Difference.

The morphological operation of removing noise blurred

the boundary Therefore, after localization of backbone, the

boundary of the dendrite was detected via varies of the pixel

intensity of the preprocessed image from Section 2.2 We

can observe that the pixel intensity of the line pixel changes

abruptly at the boundary locations The boundary location

was performed in two steps In the first step, we have searched

the image along the two directions perpendicular to the local

line direction until the pixel intensity of the line pixel changed

sharply We set a threshold for each pixel The local line

direction is determined as

𝐴𝑠𝑓 (𝑥, 𝑦) = arctan (𝑊𝑦,𝑠𝑓 (𝑥, 𝑦)

𝑊𝑥,𝑠𝑓 (𝑥, 𝑦)) (8)The formulation of each pixel is given by (𝛼, 𝐼(𝑝)), in

which𝐼(𝑝) is the pixel intensity of point 𝑝 in the original

image and𝛼 is a predefined pixel intensity value, that is,

if{

{

𝐼 (𝑝) ≥ 𝛼, p belongs to the line pixel

𝐼 (𝑝) < 𝛼, p does not belong to the line pixel. (9)

In the second step, some boundary points that were not

on the searching path could be missed The missed boundarypoints were detected from the neighboring boundary points.Provided that there are two known boundary points, if theyare adjacent, there were no other boundary points betweenthem; otherwise, the method proposed by Tang and You [31]was used to find the missed points, which can link the twopoints into a discrete line with one point as the starting pointand the other one as the ending point

There are several advantages of our proposed algorithmsfor backbone detection and boundary location (1) The firstare computing efficiency and noise reduction Our approachuses less computing time than the method based on thederivatives of the Gaussian kernel and is more robust whendealing with the noise (2) Meanwhile, it reduces the error ratefor misclassifying spine pixels as dendrite pixels and sharplyreduces the disconnection rate, which means our approach ismore robust when dealing with the disturbance informationthan other methods, such as NDE proposed by He et al [22]

2.5 Spine Detection Based on Regularized Morphological Shared-Weight Neural Network (RMSNN) Considering the

dendritic spine’s structure, we have employed the regularizedmorphological shared-weight neural networks for the detec-tion and classification of spines The regularized morpho-logical shared-weight neural networks consist of two-phaseheterogeneous neural networks in series as shown in Figure 4:the first phase is for feature extraction and the second phase isfor classification In the first phase, it is accomplished via thegray-scale Hit-Miss transform The feature extraction phasehas multiple feature extraction layers Each layer is composed

of one or more feature maps Each feature map is generated

by the Hit-Miss transform with a pair of structure elements(SEs) from the previous layer and is accompanied by a newpair of SEs, in which one is for the erosion and the otherone is for the dilation In the classification stage, it shows

a fully connected Feedforward Neural Network (FNN) [32–34] The input of FNN is the direct output of the featureextraction stage The output of the classification stage is athree-node layer, in which each node stands for one type

of spine Figure 4 shows the structure of the morphologicalshared-weight neural network (MSNN) [35] The MSNNhas been widely applied in the following research fields,

Trang 15

including laser radar (LADAR), forward-looking infrared

(FLIR), synthetic aperture radar, and visual spectrum image

The existing research demonstrates that the MSNN is robust

for detection with rotation, image intensity translation, and

occlusion variables [36] In this paper, we have proposed to

apply the regularized morphological shared-weight neural

network to spine classification

Dilation is defined as

𝐴 ⊕ 𝐵 = {𝑥 | ( ̂𝐵)𝑥∩ 𝐴 ̸= 0} , (10)

in which𝐴 and 𝐵 are sets in 𝑍2and ̂𝐵 is the reflection of 𝐵

0 is the empty set Equation (10) is termed the dilation of 𝐴

by SE𝐵 Dilation is the reflection of 𝐵 about its origin, then

translated by𝑥, with the set of all 𝑥, which allow ̂𝐵 to intersect

𝐴 with at least one element

Erosion is defined as (11) or (12) by the duality of the

erosion-dilation relationship:

𝐴 ⊖ 𝐵 = {𝑥 | (𝐵)𝑥⊆ 𝐴} , (11)

𝐴 ⊖ 𝐵 = (𝐴𝑐⊕ ̂𝐵)𝑐, (12)

in which𝐴𝑐is defined as the complement of𝐴

Hit-Miss transform is defined as an operation that detects

a given pattern in a binary image based on a pair of disjoint

structure elements, one for Hit and the other one for Miss

The result of the Hit-Miss transform is a set of positions,

where the first SE fits in the foreground of the input image

and the second SE misses it completely:

𝐴 ⊗ 𝐵 = (𝐴 ⊖ 𝑋) ∩ (𝐴𝑐(𝑊 − 𝑋)) , (13)

in which𝑋 is a SE that consisted from set 𝐵, 𝑊 is an enclosing

window of𝑋, and (𝑊 − 𝑋) is the local background of 𝑋 By

supposing𝑋 as 𝐻, the Hit SE, and (𝑊 − 𝑋) as 𝑀, the Miss

SE, we can get

𝑈 (𝑓) = {(𝑥, 𝑦, 𝑧) | (𝑥, 𝑦) ∈ 𝐷𝑓, 𝑧 ≤ 𝑓 (𝑥, 𝑦)} , (16)where we take𝐷𝑓 as the domain of𝑓 Then the gray scaledilation can be defined as

(𝑓 ⊕ 𝑏) (𝑠, 𝑡) = max {𝑓 (𝑠 − 𝑥, 𝑡 − 𝑦)+ 𝑏 (𝑥, 𝑦) | (𝑠 − 𝑥) , (𝑡 − 𝑦) ∈ 𝐷𝑓; (𝑥, 𝑦) ∈ 𝐷𝑏} (17)Meanwhile, erosion is defined as

(𝑓 ⊖ 𝑏) (𝑠, 𝑡) = min {𝑓 (𝑠 + 𝑥, 𝑡 + 𝑦)

− 𝑏 (𝑥, 𝑦) | (𝑠 + 𝑥) , (𝑡 + 𝑦) ∈ 𝐷𝑓; (𝑥, 𝑦) ∈ 𝐷𝑏} (18)The gray scale erosion measures the minimum gapbetween the image values𝑓 and the translated SE values overthe domain of 𝑥 The gray scale dilation is the dual of theerosion and indirectly measures how well the SEs fit above𝑓.The Hit-Miss transform measures how a shapeℎ fits under 𝑓using erosion and how a shape𝑚 fits above 𝑓 via dilation Thehigh value of Hit-Miss transform means good fit The grayscale Hit-Miss transform is independent of shifting in grayscale

2.5.1 The Feature Extraction Phase There are four elements

associated with each layer of feature extraction phase: featuremaps, input, and two structure elements In the first layer,the subimage is used as input, and the last layer’s output isthe input of the classification stage In each feature extractionlayer, a pair of Hit-Miss SEs is shared within all the featuremaps These SEs are translated as input weights for the featuremap nodes in the feature extraction layer Table 1 shows theinput parameters and output parameters related to the featureextraction phase

According to the above parameters, we can define the Miss transform as follows:

Hit-netℎ𝑦= min𝑥∈𝐷𝑡𝑦{𝑎 (𝑥) − 𝑡ℎ𝑦(𝑥)} ,net𝑚𝑦 = max

Trang 16

Table 1: Parameters of the feature extraction phase.

Parameter Definition

Input

𝑎(𝑥) The input to a node𝑦 from node 𝑥

𝑡𝑦(𝑥) Connections associating the nodenode x 𝑦 with

𝑡ℎ(𝑥𝑦) Hit SE associating node𝑦 with node 𝑥

𝑦(𝑥) Weight for Hit SE node𝑦 with 𝑥

for the Hit and Miss SE is derived based on the gradient

decent as

Δ𝑡ℎ𝑦= 𝜂𝛿𝑦 𝜕net

ℎ 𝑦

𝜕𝑡ℎ(𝑥),

Δ𝑡𝑚̂

𝑦 = −𝜂𝛿𝑦 𝜕net

𝑚 𝑦

Equation (21) is for the top level or final extraction layer

𝛿𝑦for the lower layers of multiple-layer feature extraction is

expressed as

𝛿𝑦= 𝛿 (𝑦) = ∑ 𝑘𝛿𝑘(𝜕net

ℎ 𝑦

𝜕𝑎 (𝑦)−

𝜕net𝑚𝑦

𝜕𝑎 (𝑦)) , (22)

in which𝑘 is the node in the layer next to the node 𝑦

Based on the back-propagation of error from the

classifi-cation stage with these learning rules, the MSNN learns the

optimized SE to extract the features by each set of Hit-Miss

𝑤𝑗𝑖𝑂𝑖+ Δ𝑗, (24)

in which𝑤𝑗𝑖is the connection weight strength to node𝑗 from

node𝑖 and Δ𝑗 is the bias output for node𝑗 𝑤𝑗𝑖is typically

learned by the back-propagation of error The update rule

of connecting weight for each connection is expressed as

𝛿𝑗 = 𝑓󸀠(net𝑗) ∑

𝑘

𝛿𝑘𝑤𝑗𝑖 (27)

2.5.2 The Classification Phase The classification phase takes

the output directly from the last feature extraction layer asits input The parameters used for the classification phase arepredefined in the feature extraction phase There are threeoutput nodes for the classification stage of our algorithm,indicating which type of spines the subimage contains

2.5.3 Acceleration of the MSNN Based on the Regularization.

In order to accelerate the learning rate and decrease thelearning epochs, we employed the regularization factor.Regularization is used to reduce near-zero connection weightvalue to zero, therefore reducing the complexity of thenetwork It is defined as

For the training procedure, the RMSNN takes the age as the input and makes one output value for each image.For the testing procedure, our proposed algorithm scans thewhole ROI and generates an image named the detectionplane, which is based on the outputs from the target classnodes

subim-3 Experimental Evaluation

3.1 Experiment Design We have trained neural networks

with the back-propagation algorithm The subimages weresubmitted to the input nodes of the neural network The error

of the output was propagated through all the connections Theprocess repeated until the network converged to a stable statewith required MSE When the MSE approximated to a presetvalue or the maximum epoch was achieved, the algorithmconverged and the training would stop During the training,the RMSNN took each subimage as the input and producedone output value for each of the three categories Figure 2(a)shows the samples of subimages containing mushroom type

Trang 17

spine Figure 2(b) shows the samples of the subimages

con-taining the stubby type, and Figure 2(c) shows the samples of

thin type subimage

In the training step, the subimage samples were input

to the network sequentially The median-squared error was

employed to measure the training effectiveness For each

subimage, the RMSNN produced one output value, which

indicated the type of spine in the subimage Then, we scanned

the entire microscopy image and finally generated a detection

plane according to the output nodes of RMSNN

In order to test the classification accuracy, we randomly

selected 900 samples for each type of spine, respectively

Following common convention and ease of stratified cross

validation, 10× 10-fold stratified cross validation (CV) was

used for the dataset to perform an unbiased statistical

analysis The RMSNN was constructed in the form as two

feature extraction layers, one hidden layer with ten hidden

neurons and one output layer with three neurons The input

subimage size was 20 by 20 pixels, and the size of the structure

elements was with the radius of 4 pixels The initial weight was

in the range of[−1.0, 1.0] The learning rate was set to 0.0015

The maximum training epoch was predefined as 15000 The

expected output values for mushroom, stubby, and thin type

spines were [1 0 0], [0 1 0], and [0 0 1]

3.2 Experiment Results

3.2.1 Backbone Extraction The extraction result is shown in

Figure 5 Figure 5(a) shows the original image Figure 5(b)

shows the extracted backbone, of which the width covers

merely one pixel

3.2.2 Boundary Location Figure 6(a) shows the mark of the

located backbone of the dendrite based on the original image,

and Figure 6(b) shows the marked boundary of the dendrite

after the backbone is extracted Figure 6(c) shows the marked

dendrite that determines the starting point of the spine

3.2.3 Spine Analysis Figure 7 shows a ROI of our sample

image, and Figure 7(b) shows the detection result of the

spines The backbone is marked by the purple color and the

boundary is marked by the red color The spines are marked

by their periphery of blue color

Figure 8(a) shows the original image with the marked

region of interest Figure 8(b) shows the classification result

based on the features extracted in the first phase The

corre-sponding SE gets respect features around each pixel, but it is

blind for readers to understand which features are obtained

The detected spines contain 8 mushroom types, 8 stubby

types, and 4 thin types The average of the classification

accuracy of RMSNN is shown in Table 2 based on the 2700

samples in total We can find that the detection of the

mushroom and thin types has better performance than the

stubby type It is because the stubby type seems connected

with the major lines, and the neck of the spine is blurred

Figures 8(c), 8(d), and 8(e) demonstrate partial geometric

attributes of the spines, including the area, perimeter, and

width We found that the areas of the spines of the ROI ranged

within [10, 23] and the perimeter ranged within [8, 88]

Table 2: Average of the classification accuracy on a 10-by-10 CV

3.3 Optimal Parameter in SE According to [36], unsuitable

SEs will degrade the performance of the RMSNN; hence,

it is critical to choose the proper SEs According to theaverage size of the spines as 20 by 20 pixels, we selected SEswith different sizes and shapes to test the performance Thecomparison of classification accuracies based on the 2700samples is shown in Table 3 We can find that the disk with

a radius of 4 pixels reaches the best performance Therefore,

we finally defined the SEs as a disk with the radius of 4 pixels

3.4 Algorithm Comparison To further validate the efficacy

of our proposed approach, we have compared the proposedalgorithm with Cheng et al.’s method [18] and the manualmethod In Cheng et al.’s paper, the authors employed theadaptive threshold to segment the image and Chen andMolloi’s algorithm [37] to extract the backbone and then usedthe local SNR for the detection of the detached spine and localspine morphology for the detection of the attached spines.The comparison results based on ROI1 in Figure 8 and 15images collected in our database are shown in Table 4 It isfound from Figure 9 that Cheng et al.’s method missed somesmall protrusions whose number of pixels is more than 5.The number of detected spines via our algorithm is 19, 13

by Cheng et al.’s method, and 20 via the manual method asshown in Table 4 Cheng et al.’s method is robust at dealingwith the spines detached from the dendrite but weak at spinesattached with the dendrite However, the detached spinesfrom the dendrite are caused by the deconvolution to denoisethe image Our proposed algorithm overcomes the problem

of detecting attached spines

4 Discussion

In this paper, we have proposed new algorithms using ditional symmetric analysis and regularized morphologicalshared-weight neural network to detect and analyze thedendrite and dendritic spines

con-Figure 5 shows that backbone extraction result based onthe conditional symmetry analysis Compared to the second-order directional derivatives method in [14], our proposedalgorithms reduced the computation time of linking thebreaking point of the backbone

Figure 6 shows the result of the marked backbone andthe boundary of the dendrite, which is used to determine thestarting point of the spines

Table 2 shows the classification result of the differenttypes of spines The row in Table 2 stands for the actual classand the column in Table 2 stands for the predicted class.The “mushroom” type has an obvious head and thin neck.The “stubby” type lacks obvious neck, and the “thin” typelacks obvious head In Table 2, the detection accuracy of

Trang 18

Table 3: Classification accuracy by different SEs (unit is in pixel, bold denotes the best,𝑟 is radius, and 𝑤 is width).

(a) Original image (b) Extracted backbone

Figure 5: Backbone extraction result

(a) Centerline of the dendrite (b) Boundary of the dendrite

(c) DendriteFigure 6: Dendrite location results

Figure 7: (a) ROI of the original Image (b) Detection result of the spines

Table 4: Detection result of ROI1 in Figure 8 and 15 images in our

Trang 19

(a) Original image (b) Detection plane

50 100 150 200

(d) Histogram of the area distribution

0 2 4 6 8 10 12 14 16 18 0

10 20 30 40 50 60 70 80

90 Perimeter

(e) Histogram of the perimeter distributionFigure 8: Experiment result with corresponding parameters for characterization

that our algorithm has better performance than the other

two methods for the images obtained by the confocal laser

scanning microscopy

5 Conclusion

In this paper, we proposed a new automatic approach to

accurately identify dendritic spines with different shapes

The novelty of this approach includes (1) a new model usingwavelet-based conditional symmetry analysis for dendritebackbone extraction and localization, which is the first steptowards identification of dendritic spins; (2) a new algorithmbased on regularized morphological shared-weight neuralnetworks for classification of spines into the right classes(i.e., mushroom, stubby, and thin), entitled “RMSNN.” Thisresearch was based on our collected microscopy images We

Trang 20

(a) ALS [18] (b) SRMSNNFigure 9: Detection result based on ALS and SRMSNN.

have applied our approach to image base library containing

around 2700 subimage samples, 900 for each type of spines,

and have compared the proposed method with the existing

methods The experimental results demonstrate that our

algorithm outperforms existing methods with a significant

improvement in accuracy in terms of classifying spines into

the different spine categories The classification accuracy is

99.1% for mushroom spines, 97.6% for stubby spines, and

98.6% for thin spines

The future work will be focusing on further validation

of the robustness of the algorithms through collecting more

samples and testing on different datasets A user-friendly

interface will be also built for usability improvement and

enhancement Meanwhile, we will be focusing on reducing

the computation time while improving the classification

accuracy based on the 3D image stacks Other feature

extraction tools (such as wavelet packet analysis [38], wavelet

entropy [39], and 3D-DWT [40]) and other advanced

classifi-cation tools [41, 42] will be tested Besides, swarm intelligence

method will be used to find optimal parameters [43]

Conflict of Interests

The authors declare that there is no conflict of interests

regarding the publication of this paper

Acknowledgment

This work was financially supported by the National Natural

Science Foundation of China (no 61271231)

References

[1] J L Krichmar, S J Nasuto, R Scorcioni, S D Washington,

and G A Ascoli, “Effects of dendritic morphology on CA3

pyramidal cell electrophysiology: a simulation study,” Brain

Research, vol 941, no 1-2, pp 11–28, 2002.

[2] D Johnston and S M.-S Wu, Foundations of Cellular

Neuro-physiology, MIT Press, Cambridge, Mass, USA, 1995.

[3] Z F Mainen and T J Sejnowski, “Influence of dendritic

structure on firing pattern in model neocortical neurons,”

Nature, vol 382, no 6589, pp 363–366, 1996.

[4] N Keren, N Peled, and A Korngreen, “Constraining

compart-mental models using multiple voltage recordings and genetic

algorithms,” Journal of Neurophysiology, vol 94, no 6, pp 3730–

[6] K M Stiefel and T J Sejnowski, “Mapping function onto

neuronal morphology,” Journal of Neurophysiology, vol 98, no.

[9] T M Liu, G Li, J X Nie et al., “An automated method for cell

detection in zebrafish,” Neuroinformatics, vol 6, no 1, pp 5–21,

2008

[10] W Yu, H K Lee, S Hariharan, W Bu, and S Ahmed,

“Evolving generalized voronoi diagrams for accurate cellular

image segmentation,” Cytometry Part A, vol 77, no 4, pp 379–

386, 2010

[11] M K Bashar, K Komatsu, T Fujimori, and T J Kobayashi,

“Automatic extraction of nuclei centroids of mouse embryonic

cells from fluorescence microscopy images,” PLoS ONE, vol 7,

no 5, Article ID e35550, 2012

[12] J L Martiel, A Leal, L Kurzawa et al., “Measurement of cell

traction forces with ImageJ,” in Methods in Cell Biology, E K.

Paluch, Ed., vol 125, chapter 15, pp 269–287, Academic Press,2015

[13] D L Dickstein, A Rodriguez, A B Rocher et al., Studio: an automated quantitative software to assess changes in

“Neuron-spine pathology in Alzheimer models,” Alzheimer’s & Dementia,

vol 6, no 4, article S410, 2010

[14] E Meijering, M Jacob, J.-C F Sarria, P Steiner, H Hirling, and

M Unser, “Design and validation of a tool for neurite tracing

and analysis in fluorescence microscopy images,” Cytometry

Part A, vol 58, no 2, pp 167–176, 2004.

[15] J Cheng, X B Zhou, B L Sabatini, and S T C Wong, ronIQ: a novel computational approach for automatic dendrite

“Neu-spines detection and analysis,” in Proceedings of the IEEE/NIH

Life Science Systems and Applications Workshop (LISA ’07), pp.

168–171, IEEE, Bethesda, Md, USA, November 2007

Trang 21

[16] I Y Y Koh, W B Lindquist, K Zito, E A Nimchinsky, and

K Svoboda, “An image analysis algorithm for dendritic spines,”

Neural Computation, vol 14, no 6, pp 1283–1310, 2002.

[17] X Y Xu, J Cheng, R M Witt, B L Sabatini, and S T C Wong,

“A shape analysis method to detect dendritic spine in 3D optical

microscopy image,” in Proceedings of the 3rd IEEE International

Symposium on Biomedical Imaging: From Nano to Macro, pp.

554–557, Arlington, Va, USA, April 2006

[18] J Cheng, X Zhou, E Miller et al., “A novel computational

approach for automatic dendrite spines detection in

two-photon laser scan microscopy,” Journal of Neuroscience Methods,

vol 165, no 1, pp 122–134, 2007

[19] J Fan, X Zhou, J G Dy, Y Zhang, and S T C Wong, “An

automated pipeline for dendrite spine detection and tracking of

3D optical microscopy neuron images of in vivo mouse models,”

Neuroinformatics, vol 7, no 2, pp 113–130, 2009.

[20] Y Zhang, X B Zhou, R M Witt, B L Sabatini, D Adjeroh,

and S T C Wong, “Dendritic spine detection using curvilinear

structure detector and LDA classifier,” NeuroImage, vol 36, no.

2, pp 346–360, 2007

[21] F Janoos, K Mosaliganti, X Xu, R Machiraju, K Huang, and

S T C Wong, “Robust 3D reconstruction and identification

of dendritic spines from optical microscopy imaging,” Medical

Image Analysis, vol 13, no 1, pp 167–179, 2009.

[22] T He, Z Xue, and S T C Wong, “A novel approach for three

dimensional dendrite spine segmentation and classification,” in

Medical Imaging 2012: Image Processing, vol 8314 of Proceedings

of SPIE, San Diego, Calif, USA, February 2012.

[23] P Shi, Y Huang, and J Hong, “Automated three-dimensional

reconstruction and morphological analysis of dendritic spines

based on semi-supervised learning,” Biomedical Optics Express,

vol 5, no 5, pp 1541–1553, 2014

[24] S Reid, C Lu, I Casikar et al., “Prediction of pouch of Douglas

obliteration in women with suspected endometriosis using a

new real-time dynamic transvaginal ultrasound technique: the

sliding sign,” Ultrasound in Obstetrics & Gynecology, vol 41, no.

6, pp 685–691, 2013

[25] S Reid, C Lu, I Casikar et al., “The prediction of pouch of

Douglas obliteration using offline analysis of the transvaginal

ultrasound ‘sliding sign’ technique: inter-and intra-observer

reproducibility,” Human Reproduction, vol 28, no 5, pp 1237–

1246, 2013

[26] Y.-H Wang, W.-N Liu, A.-H Chen, and Y Wang, “Nonlinear

dim target enhancement algorithm based on partial differential

equation,” Journal of Dalian Maritime University, vol 34, no 2,

pp 57–60, 2008

[27] L Chen, J H Zhang, S Y Chen, Y Lin, C Y Yao, and J

W Zhang, “Hierarchical mergence approach to cell detection

in phase contrast microscopy images,” Computational and

Mathematical Methods in Medicine, vol 2014, Article ID 758587,

10 pages, 2014

[28] N Otsu, “A threshold selection method from gray-level

his-tograms,” IEEE Transactions on Systems, Man and Cybernetics,

vol 9, no 1, pp 62–66, 1979

[29] P.-S Liao, T.-S Chen, and P.-C Chung, “A fast algorithm for

multilevel thresholding,” Journal of Information Science and

Engineering, vol 17, no 5, pp 713–727, 2001.

[30] L H Yang, X You, R M Haralick, I T Phillips, and Y Y Tang,

“Characterization of Dirac edge with new wavelet transform,” in

Proceedings of the 2nd International Conference on Wavelets and

Applications, vol 1, pp 872–878, Hong Kong, December 2001.

[31] Y Y Tang and X G You, “Skeletonization of ribbon-like shapes

based on a new wavelet function,” IEEE Transactions on Pattern

Analysis and Machine Intelligence, vol 25, no 9, pp 1118–1133,

2003

[32] Y D Zhang, S H Wang, G L Ji, and P Phillips, “Fruitclassification using computer vision and feedforward neural

network,” Journal of Food Engineering, vol 143, pp 167–177, 2014.

[33] S Wang, Y Zhang, Z Dong et al., “Feed-forward neuralnetwork optimized by hybridization of PSO and ABC for

abnormal brain detection,” International Journal of Imaging

Systems and Technology, vol 25, no 2, pp 153–164, 2015.

[34] G Yang, Y Zhang, J Yang et al., “Automated classification

of brain images using wavelet-energy and biogeography-based

optimization,” Multimedia Tools and Applications, 2015.

[35] D Guo, Y Zhang, Q Xiang, and Z Li, “Improved radiofrequency identification indoor localization method via radial

basis function neural network,” Mathematical Problems in

Engineering, vol 2014, Article ID 420482, 9 pages, 2014.

[36] X Jin and C H Davis, “Vehicle detection from high-resolutionsatellite imagery using morphological shared-weight neural

networks,” Image and Vision Computing, vol 25, no 9, pp 1422–

1431, 2007

[37] Z Chen and S Molloi, “Automatic 3D vascular tree construction

in CT angiography,” Computerized Medical Imaging and

Graph-ics, vol 27, no 6, pp 469–479, 2003.

[38] Y Zhang, Z Dong, S Wang, G Ji, and J Yang, “Preclinicaldiagnosis of magnetic resonance (MR) brain images via discretewavelet packet transform with tsallis entropy and generalizedeigenvalue proximate support vector machine (GEPSVM),”

Entropy, vol 17, no 4, pp 1795–1813, 2015.

[39] Y Zhang, S Wang, P Sun et al., “Pathological brain detection

based on wavelet entropy and Hu moment invariants,”

Bio-Medical Materials and Engineering, vol 26, supplement 1, pp.

S1283–S1290, 2015

[40] Y Zhang, S Wang, P Phillips, Z Dong, G Ji, and J Yang,

“Detection of Alzheimer’s disease and mild cognitive ment based on structural volumetric MR images using 3D-

impair-DWT and WTA-KSVM trained by PSOTVAC,” Biomedical

Signal Processing and Control, vol 21, pp 58–73, 2015.

[41] S Wang, Y Zhang, G Ji, J Yang, J Wu, and L Wei, “Fruit sification by wavelet-entropy and feedforward neural networktrained by fitness-scaled chaotic ABC and biogeography-based

clas-optimization,” Entropy, vol 17, no 8, pp 5711–5728, 2015.

[42] Y Zhang, Z Dong, P Phillips et al., “Detection of subjectsand brain regions related to Alzheimer’s disease using 3D MRI

scans based on eigenbrain and machine learning,” Frontiers in

Computational Neuroscience, vol 9, article 66, 15 pages, 2015.

[43] S Wang, X Yang, Y Zhang, P Phillips, J Yang, and T.-F Yuan,

“Identification of green, oolong and black teas in China viawavelet packet entropy and fuzzy support vector machine,”

Entropy, vol 17, no 10, pp 6663–6682, 2015.

Trang 22

Review Article

An Overview of Biomolecular Event Extraction from

Scientific Documents

1 MindLab Research Laboratory, Universidad Nacional de Colombia, Bogot´a, Colombia

2 DETI/IEETA, University of Aveiro, Campus Universit´ario de Santiago, 3810-193 Aveiro, Portugal

Correspondence should be addressed to S´ergio Matos; aleixomatos@ua.pt

Received 13 May 2015; Revised 10 August 2015; Accepted 18 August 2015

Academic Editor: Chuan Lu

Copyright © 2015 Jorge A Vanegas et al This is an open access article distributed under the Creative Commons AttributionLicense, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properlycited

This paper presents a review of state-of-the-art approaches to automatic extraction of biomolecular events from scientific texts.Events involving biomolecules such as genes, transcription factors, or enzymes, for example, have a central role in biologicalprocesses and functions and provide valuable information for describing physiological and pathogenesis mechanisms Eventextraction from biomedical literature has a broad range of applications, including support for information retrieval, knowledgesummarization, and information extraction and discovery However, automatic event extraction is a challenging task due to theambiguity and diversity of natural language and higher-level linguistic phenomena, such as speculations and negations, whichoccur in biological texts and can lead to misunderstanding or incorrect interpretation Many strategies have been proposed in thelast decade, originating from different research areas such as natural language processing, machine learning, and statistics Thisreview summarizes the most representative approaches in biomolecular event extraction and presents an analysis of the currentstate of the art and of commonly used methods, features, and tools Finally, current research trends and future perspectives are alsodiscussed

1 Introduction

The scientific literature is the most important medium for

disseminating new knowledge in the biomedical domain

Thanks to advances in computational and biological

meth-ods, the scale of research in this domain has changed

remark-ably, reflected in an exponential increase in the number of

scientific publications [1] This has made it harder than ever

for scientists to find, manage, and exploit all relevant studies

and results related to their research field [1] Because of this,

there is growing awareness that automated exploitation tools

for this kind of literature are needed [2] To address this

need, natural language processing (NLP) and text mining

(TM) techniques are rapidly becoming indispensable tools

to support and facilitate biological analyses and the curation

of biological databases Furthermore, the development of

this kind of tools has enabled the creation of a variety

of applications, including domain-specific semantic search

engines and tools to support the creation and annotation

of pathways or for automatic population and enrichment of

databases [3–5]

Initial efforts in biomedical TM focused on the mental tasks of detecting mentions of entities of interestand linking these entities to specific identifiers in refer-ence knowledge bases [6, 7] Although entity normalizationremains an active research challenge, due to the high level

funda-of ambiguity in entity names, some existing tools funda-offerperformance levels that are sufficient for many informationextraction applications [6] In recent years there has beenincreased interest in the identification of interactions betweenbiologically relevant entities, including, for instance, drug-drug [8] or protein-protein interactions (PPIs) [9] Amongstthese, the identification of PPIs mentioned in the literaturehas received most attention, encouraged by their importance

in systems biology and by the necessity to accelerate thepopulation of numerous PPI databases

Following the advances achieved in PPI extraction, itbecame relevant to automatically extract more detaileddescriptions of protein related events that depict pro-tein characteristics and behavior under certain conditions.Such events, including expression, transcription, localization,http://dx.doi.org/10.1155/2015/571381

Trang 23

Cause

Theme Theme

Pos Reg.

Protein

gene

Figure 1: Example of complex biomolecular event extracted from a text fragment A recursive structure, composed of two types of events, ispresented: Positive Regulation and Expression

binding, or regulation, among others, play a central role in

the understanding of biological processes and functions and

provide insight into physiological and pathogenesis

mecha-nisms Automatically creating structured representations of

these textual descriptions allows their use in information

retrieval and question answering systems, for constructing

biological networks composed of such events [2] or for

inferring new associations through knowledge discovery

Unfortunately, extraction of this kind of biological

informa-tion is a challenging task due to several factors: firstly, the

biological processes described are generally complex,

involv-ing multiple participants which may be individual entities

such as genes or proteins, groups, or families, or even other

biological processes; sentences describing these processes are

long and in many cases have long-range dependencies; and,

finally, biological text is also rich in higher level linguistic

phenomena, such as speculation and negation, which may

cause misinterpretation of the text if not handled properly

[1, 9]

This review summarizes the different approaches used

to address the extraction and formalization of

biomolec-ular events described in scientific texts The downstream

impact of these advances, namely, for network extraction,

for pharmacogenomics studies, and in systems biology

and functional genomics, has been highlighted in recent

reviews [2, 4, 10], which have also described various

end-user systems developed on top of these technologies This

review focuses on the methodological aspects, describing

the available resources and tools as well as the features,

algorithms, and pipelines used to address this information

extraction task, and specifically for protein related events,

which have received the most attention in this perspective

We present and discuss the most representative methods

currently available, describing the advantages,

disadvan-tages, and specific characteristics of each strategy The most

promising directions for future research in this area are also

discussed

The contents of this paper are organized as follows: we

start by introducing biomolecular events and defining the

event extraction task; we then describe the event extraction

steps, present commonly used frameworks, text processing,

and NLP tools and resources, and compare the different

approaches used to address this task; in the following section

we compare the performance of the proposed methods and

systems, followed by a discussion regarding the most relevant

aspects; finally, we present some concluding remarks in the

last section

2 Biomolecular Events

In the biomedical domain, an event refers to the change ofstate of one or more biomedical entities, such as proteins,cells, and chemicals [11] In their textual description, anevent is typically referenced through a trigger expression thatspecifies the event and indicates its type These triggers aregenerally verbal forms (e.g., “stimulates”) or nominalizations

of verbs (e.g., “expression”) and may occur as a single word or

as a sequence of words This textual description also includesthe entities involved in the event, referred to as participants,and possibly additional information that further specifies theevent, such as a particular cell type in which the describedevent was observed Biomolecular events may describe thechange of a single gene or protein, therefore having onlyone participant denoting the affected entity, or may havemultiple participants, such as the biomolecules involved in

a binding process, for example Additionally, an event mayact as participant in a more complex event, as in the case

of regulation events, requiring the detection of recursivestructures

Extraction of event descriptions from scientific texts hasattracted substantial attention in the last decade, namely,for those events involving proteins and other biomolecules.This task requires the determination of the semantic types ofthe events, identifying the event participants, which may beentities (e.g., proteins) or other events, their correspondingsemantic role in the event, and finally the encoding of thisinformation using a particular formalism This structureddefinition of events is associated with an ontology thatdefines the types of events and entities, semantic roles, andalso any other attributes that may be assigned to an event.Examples of ontologies for describing biomolecular eventsinclude the GENIA Event Ontology [11] and Gene Ontology[12]

Figure 1 presents an example of a complex event described

in the text fragment “TNF-alpha is a rapid activator of IL-8

gene expression by ” From this fragment we can construct

a recursive structure composed of two events: a first event, of

type Expression denoted by the trigger word “expression” that has a single argument (“IL-8”) with the role Theme (denoting

that this is the participant affected by the event), and a second

event of type Positive Regulation, defined by the trigger word “activator.” This second event has two participants: the protein “TNF-alpha” with the role Cause (defining that this

protein is the cause of the event) and the first event with the

role Theme.

Trang 24

Preprocessing and feature extraction

Syntactic parsing Dependency parsing Phrase structure

and deep parsing Gdep parser [13]

Charniak-Johnson/

McClosky [14, 15]

Bikel parser [16]

Stanford parser [17]

Enju-GENIA [18]

ERG [19]

Frameworks NLTK [22]

Stanford CoreNLP [23]

disorders Gimli [27]

Zhang et al.

[43]

3

Lexicons BioLexicon [39]

WordNet [49]

UMLS [40]

Edge detection 4

SVM-multiclass [45]

LIBLINEAR [46]

Postprocessing 5

Tools Stanford CoreNLP [23]

SVM-rank [50]

Tools ISimp [20]

GENIA tagger [21]

Figure 2: Overall pipeline of a biomedical event extraction solution Joint prediction methods merge steps 3 and 4 in a single step Thecorresponding reference paper for each tool and method is also identified [13–50]

3 Event Extraction

Figure 2 illustrates a common event extraction pipeline,

iden-tifying the most popular tools, models, and resources used in

each stage The two initial stages are usually preprocessing

and feature extraction, followed by the identification of

named entities The next step is to perform event detection.This step is frequently divided into two separate stages:trigger detection, which consists of the identification ofevent triggers and their type, and edge detection (or eventconstruction), which is focused on associating event triggerswith their arguments Some authors, on the other hand,

Trang 25

have addressed event detection in a single, joint prediction

step These approaches tackle the cascading errors that occur

with the two-stage methods and have commonly shown

improved performance Finally, a postprocessing stage is

usually present, to refine and complete the candidate event

structures Negation or speculation detection may also be

included in this final step This section describes each phase,

presenting the most commonly used approaches

3.1 Corpora for Event Extraction The development and

improvement of information extraction systems usually

requires the existence of manually annotated text collections,

or corpora This is mostly true for supervised machine

learning methods, but annotated data can also be exploited

for inferring patterns to be used in rule-based approaches In

the case of biomedical event extraction, various corpora have

been compiled, including corpora annotated with

protein-protein interactions

3.1.1 GENIA Event Corpus The GENIA Event corpus

con-tains human-curated annotations of complex, nested, and

typed event relations [51, 52] The GENIA corpus [53]

is composed of 1,000 paper abstracts from Medline It

contains 9,372 sentences from which 36,114 events are

identified This corpus is provided by the organizers of

BioNLP shared task to participants as the main resource

for training and evaluation and is publicly available online

(http://www.nactem.ac.uk/aNT/genia.html)

3.1.2 BioInfer Corpus BioInfer (Biomedical Information

Extraction Resource) (http://www.it.utu.fi/BioInfer) [54] is

a public resource providing manually annotated corpus and

related resources for information extraction in the

biomedi-cal domain

The corpus contains sentences from abstracts of

biomed-ical research articles annotated for relationships, named

entities, and syntactic dependencies The corpus is annotated

with proteins, genes, and RNA relationships and serves as

a resource for the development of information extraction

systems and their components such as parsers and domain

analyzers The corpus is composed of 1100 sentences from

abstracts of biomedical research articles

3.1.3 Gene Regulation Event Corpus The Gene Regulation

Event Corpus (GREC) (http://www.nactem.ac.uk/GREC/)

[55] consists of 240 MEDLINE abstracts, in which events

relating to gene regulation and expression have been

anno-tated by biologists This corpus has the particularity that

not only core relations between entities that are annotated,

but also a range of other important details about these

relationships, for example, location, temporal, manner, and

environmental conditions

3.1.4 GeneReg Corpus The GeneReg Corpus [56] consists of

314 MEDLINE abstracts containing 1770 pairwise relations

denoting gene expression regulation events in the model

organism E coli The corpus annotation is compatible with

the GENIA event corpus and with in-domain and domain lexical resources

out-of-3.1.5 PPI Corpora Although not as richly annotated as

event corpora, protein-protein interaction corpora may beconsidered for complementing the available training data.The most relevant PPI corpora are the LLL corpus [57], theAIMed corpus [58], and the BioCreative PPI corpus [7]

3.2 Preprocessing and Feature Extraction Preprocessing is

a required step in any text mining pipeline This includesreading the data from its original format to an internal rep-resentation, and extracting features, which usually involvessome level of text or language processing In the specificcase of event extraction, preprocessing may also involveresolving coreferences [59] or applying some form of sentencesimplification [60], for example, by expanding conjunctions,

in order to improve the extraction results

3.2.1 Preprocessing Tools Frameworks In order to derive a feature representa-

tion from texts, it is necessary to perform text cessing involving a set of common NLP tasks, goingfrom sentence segmentation and tokenization, to part-of-speech tagging, chunking, and linguistic parsing Varioustext processing frameworks exist that support these tasks,among which the following stand out: NLTK (http://www.nltk.org/), Apache OpenNLP (https://opennlp.apache.org/),and Stanford CoreNLP (http://nlp.stanford.edu/software/corenlp.shtml) (Figure 2)

pro-Syntactic Parsers A syntactic parser assigns a tree or graph

structure to a free text sentence These structures establishrelations or dependencies between the organizing verb andits dependent arguments and have been useful for manyapplications like negation detection and disambiguationamong others Syntactic parsers can be categorized in threegroups: dependency parsers, phase structure parsers, anddeep parsers [61] The aim of dependency parsers is tocompute a tree structure of a sentence where nodes arewords, and edges represent the relations among words; phrasestructure parsers focus on identifying phrases and theirrecursive structure, and deep parsers express deeper relations

by computing theory-specific syntactic/semantic structures.For the task of event extraction several implementations ofeach parser groups have been used, as shown in Figure 2

3.2.2 Features One of the main requirements of a good event

extraction system is a rich feature representation Most eventextraction systems present a complex set of features extractedfrom tokens, sentences, dependency parsing trees, and exter-nal resources Table 1 summarizes the features commonlyextracted in this processing stage and indicates their use inthe event extraction process

(i) Token-based features capture specific knowledgeregarding each token, such as syntactic or lin-guistic features, namely, part-of-speech (POS) and

Trang 26

Table 1: Most common features used in the main event detection stages.

External resources

the lemma of each token, and features based on

ortho-graphic (e.g., presence of capitalization, punctuation,

and numeric or special characters) [42, 43, 62–68] and

morphological information, namely, prefixes, suffixes,

and character n-grams [42, 43, 64, 67, 69–72]

(ii) Contextual features provide general characteristics

of the sentence or neighborhood where the target

token is present Features extracted from sentences

include the number of tokens in the sentence [42], the

number of named entities in the sentence, and

bag-of-word counts of all words [43, 64] Local context

is usually encoded through windows or conjunctions

of features, including POS tags, lemmas, and word

n-grams, extracted from the words around the target

token [42, 63, 65, 73]

(iii) Dependency parsing provides information about

grammatical relationships involving two words,

extracted from a graph representation of the

dependency relations in a sentence Commonly used

features include the number or type of dependency

hops between two tokens, and the sequence or

n-grams of words, lemmas, or POS tags in the

dependency path between two tokens [65, 68, 72, 74]

These features are usually extracted between two

entities in a sentence [64, 75], or between a candidate

trigger and an entity [75]

(iv) Finally, it is also common to encode domain

knowl-edge as features using external resources such as

lexi-cons of possible trigger words and of gene and protein

names to indicate the presence of a candidate trigger

or entity [27, 76–78] Also, the token representation

is often expanded with related words according to

some semantic relations such as WordNet hypernyms

[27, 77, 79]

3.3 Entity Recognition Entity recognition consists of the

detection of references (or mentions) to entities, such asgenes or proteins, in natural language text and labeling themwith their location and type Named-entity recognition inthe biomedical domain is generally considered to be moredifficult than in other domains, for several reasons: first,there are millions of entity names in use [71] and new onesare added constantly, implying that dictionaries cannot besufficiently comprehensive; second, the biomedical field isevolving too quickly to allow reaching a consensus on thename to be used for a given entity [80] or even regarding theexact concept defined by the entity itself So the same name

or acronym can be used for different concepts [81]

Several entity recognition systems for the biomedicaldomain have been developed in the last decade Much ofthis work has focused on the recognition of gene and proteinnames and, more recently, chemical compounds [82] In thesecases, machine learning strategies using rich sets of featureshave provided the best results, with performances in the order

of 85%𝐹-measure [83]

The most popular entity recognition tools are shown inFigure 2, which also lists the biomedical lexicons that arecommonly used, either in dictionary-matching approaches or

as features for machine learning Some of these tools, namely,BANNER [36] and Gimli [27], offer simple interfaces fortraining new models and have been applied to the recognition

of various entity types such as chemical compounds anddiseases

3.4 Trigger Detection Trigger word detection is the event

extraction task that has attracted most research interest It is

a crucial task, since the effectiveness of the following tasksstrongly depends on the information generated in this step.This task consists of identifying the chunk of text that triggersthe event and serves as predicate Although trigger words arenot restricted to a particular set of part-of-speech tags, verbs(e.g., “activates”) and nouns (e.g., “expression”) are the most

Trang 27

Expression Pos Reg.

RFLAT-1

(a)

Inhibition of LITAF mRNA expression in THP-1 cells resulted in a reduction of TNF-alpha transcripts

Neg Reg.

(b)Figure 3: Trigger detection for two example sentences: (a) “RFLAT-1 activates RANTES gene expression” and (b) “Inhibition of LITAF mRNAexpression in THP-1 cells resulted in a reduction of TNF-alpha transcripts.”

Table 2: Most relevant work addressing the problem of trigger detection Studies are listed in chronological order and the different approachesare classified in three main groups: rule-based, dictionary-based, and ML-based strategies

L: linear kernel; R: radial basis function kernel; P: polynomial kernel; C: convolution tree kernel; CS: cosine similarity.

common Furthermore, a trigger may consist of multiple

consecutive words

Figure 3 illustrates the expected results of the trigger

detection process in two example sentences As we can see

in Figure 3, trigger detection involves the identification of

event triggers and their type, as specified by the selected

ontology In sentence (a), two different kinds of events are

identified: the trigger word activates defines an event of type

Positive Regulation and the trigger word expression defines

an event of type Gene Expression Sentence (b) illustrates

the difficulty of this task: it shows that short sentences can

contain various related events; that triggers may be expressed

in diverse ways (two event of type Negative Regulation

are defined with different trigger words); and, finally, that

the same trigger word (expression) may indicate different

types of event, depending on the context

The various approaches proposed for trigger tion can be roughly categorized in three types: rule-based, dictionary-based, and machine learning-based Theseapproaches are summarized in Table 2 and presented in theremainder of this section

detec-3.4.1 Patterns and Matching Rules for Trigger Detection.

There are several strategies based on patterns [70, 93] andmatching rules Rule-based methods commonly follow somemanually defined linguistic patterns, which are then aug-mented with additional constraints based on word forms and

Trang 28

syntactic categories to generate better matching precision.

The main advantage of this kind of approach is that they

usually require little computational effort Rule-based event

extraction systems consist of a set of rules that are manually

defined or generated from training data For instance, Casillas

et al [88] present a strategy based on Kybots (Knowledge

Yielding Robots), which are abstract patterns that detect

actual concept instances and relations in a document These

patterns are defined in a declarative format, which allows

definition of variables, relations, and events Vlachos et

al [76] present a domain-independent approach based on

the output of a syntactic parser and standard linguistic

processing (namely, stemming, lemmatization, and

part-of-speech (POS) tagging, among others), augmented by rules

acquired from the development data in an unsupervised way,

avoiding the need to use explicitly annotated training data

In the dictionary-based approach, a dictionary

contain-ing trigger words with their correspondcontain-ing classes (event

types) is used to identify and assign event triggers Van

Landeghem et al [74] proposed a strategy following this

approach, using a set of manually cleaned dictionaries and

a formula to calculate the importance of each trigger word

for a particular event This is required since the same word

may be associated with events of different types [66] For

instance, in the BioNLP’09 Shared Task dataset [51], the token

“overexpression” appears as trigger for the gene expression

event in about 30% of its occurrences, while the other 70%

of occurrences are triggers for positive or negative regulation

events

Many strategies combine both approaches For instance,

Le Minh et al [70] present a strategy where rule-based and

dictionary-based approaches are combined First, they select

tokens that have appropriate POS tags and occur near a

protein mention and then apply heuristic rules extracted

from a training corpus to identify candidate triggers Finally,

a dictionary built from the training corpus and containing

trigger words and their corresponding classes is used to

classify candidate triggers For ambiguous trigger classes, the

class with the highest rate of occurrence is selected Kilicoglu

and Bergler [93] also present a combined strategy based

on a linguistically inspired rule-based and syntax-driven

methodology, using a dictionary based on trigger expressions

collected from the training corpus Events are then fully

spec-ified through syntactic dependency based heuristics, starting

from the triggers detected by the dictionary-matching step

Pattern-based methods usually present low recall rates,

since defining comprehensive patterns would require

exten-sive efforts, and because the most common patterns are too

rigid to capture semantic/syntactic paraphrases

3.4.2 Machine Learning-Based Approach to Trigger Detection.

The most recent and successful approaches to trigger word

detection are based on machine learning methods [72], with

most work defining this as a sequence-labeling problem The

definition of event types, on the other hand, is addressed as a

multiclass task, where candidate event triggers are classified

into one of the predefined types of biomedical events

In order to address these problems, several probabilistic

techniques have been proposed, using, for example, HiddenMarkov Models (HMMs), Maximum Entropy Markov Mod-els (MEMMs), Conditional Random Fields (CRFs) [94, 95],and Support Vector Machines (SVMs)

For instance, Zhou and He [89] proposed treating triggeridentification as a sequence-labeling problem and use theMaximum Entropy Markov Model (MEMM) to detect triggerwords MEMM is based on the concept of a probabilistic finitestate model such as HMM but consists of a discriminativemodel that assumes the unknown values to be learnt areconnected in a Markov chain rather than being conditionallyindependent of each other Similarly, various strategies based

on Conditional Random Fields (CRFs) have been proposed[42, 73, 85, 86] CRFs have become a popular method forsequence-labeling problems, justified mainly by the fact thatCRFs avoid the label bias problem present in MEMMs [96]but preserve all the other advantages Unlike Hidden MarkovModels (HMMs), CRF is a discriminant model So CRFsuse conditional probability for inference, meaning that theymaximize𝑝(𝑦 | 𝑥) directly, where 𝑥 is the input sequenceand𝑦 is the sequence of output labels, unlike HMMs, whichmaximize the joint probability𝑝(𝑥, 𝑦) This relaxes strongindependence assumptions required to learn the parameters

of generative models

The most recent proposals for trigger detection arebased on Support Vector Machines (SVMs) SVMs do notfollow a probabilistic approach but are instead maximummargin classifiers that try to find the maximal separationbetween classes This classifier has presented very goodresults, showing a higher generalization performance thanCRFs However, training complex SVM models may requireexcessive computational time and memory overhead Severalstrategies using different SVM implementations and kernelshave been proposed

The general approach is to classify initial candidatetriggers as positive or not, based on a set of carefullyselected features and a training set with annotated events.For instance, Bj¨orne et al [80, 86, 97] proposed a solu-tion based on the SVM-multiclass (http://www.cs.cornell.edu/people/tj/svm light/svm multiclass.html) implementa-tion with a linear kernel, optimized by exploring in anexhaustive grid search the𝐶-parameter that maximizes the𝐹-score in trigger detection In this study only linear kernelswere used since the size and complexity of the trainingset, composed of over 30 thousand instances and nearly

300 thousand features, hinders the application of morecomputationally demanding alternatives, namely, radial basisfunction kernels

In addition to purely supervised learning, which depends

on the amount and quality of annotated data, vised approaches have also been proposed Wang et al [65]combined labeled data with large amounts of unlabeled data,using a rich representation based on semantic features (such

semisuper-as walk subsequence features and n-gram features, amongothers) and a new representation based on Event FeatureCoupling Generalization (EFCG) EFCG is a strategy toproduce higher-level features based on two kinds of originalfeatures: class-distinguishing features (CDFs) which have

Trang 29

the ability to distinguish the different classes and

example-distinguishing features (EDFs) that are good at indicating

the specific examples EFCG generates a new set of features

by combining these two kinds of features and taking into

account a degree of relatedness between them

A different strategy was followed by Martinez et al., who

presented a solution based on word-sense disambiguation

(WSD) using a combined CRF-VSM (Vector Space Model)

classifier, where the output of VSM is incorporated as a feature

into the CRF [73] This approach significantly improved the

performance of each method separately

3.5 Edge Detection Edge detection (also known as event

theme construction or event argument identification) is the

task of predicting arguments for an event, which may be

named entities (i.e., genes and proteins) or another event,

represented by another trigger word Event arguments are

graphically represented through directed edges from the

trigger word for the event and the argument These edges also

express the semantic role that a participant (entity or event)

plays in a given event In Figure 4, sentence (a) illustrates a

basic event defined by the trigger word Phosphorylation that

denotes an event of type Phosphorylation The directed edge

between this trigger word and the entity TRAF2, denoting

a relation of type “Theme,” indicates that this entity is the

affected participant in this event It is important to note

that events can act as participants in other events, thus

allowing the construction of complex conceptual structures

For example, consider the sentence (c), where two events are

mentioned: a first event of type Expression and a second event

of type Positive Regulation The directed edge from the trigger

word activator and the trigger word expression denotes that

the event Expression is affected directly by the event Positive

Regulation Similarly, the edge of type cause between activator

and the entity TNFalpha indicates that this is the causing

participant for this event

Different approaches have been suggested to tackle the

edge detection task, including rule and dictionary-based

strategies and machine learning-based methods These are

summarized in Table 3 and described in the following

sub-sections

3.5.1 Patterns and Matching Rules for Edge Detection These

strategies are based on the identification of edges according to

a set of rules that can be manually defined or generated from

training data Among the most basic approaches, we find the

strategy proposed by MacKinlay et al [85], in which a specific

set of hand-coded grammars, supported by specific domain

knowledge like named entity annotations and lexicons, is

defined for each type of event In the case of basic events

a simple distance criterion is applied, assigning the closest

protein as the theme of the event, while extra criteria is

required for more complex events For instance, to assign the

Theme arguments for binding events, the maximum distance

away from the trigger event word(s), and the maximum

number of possible themes are estimated, and for regulation

events, in addition to the maximum distance, some priority

rules are used to define Cause or Theme arguments.

Kilicoglu and Bergler [93] present another rule-basedapproach, where identification of the event participants and

corresponding roles (e.g., Theme or Cause) is primarily

achieved based on a grammar created from dependency tions between event trigger expressions and event arguments

rela-in the trarela-inrela-ing corpus This strategy is based on the Stanfordsyntactic parser [98], which was applied to automaticallyextract dependency relation paths between event triggersand their corresponding event arguments These paths weremanually filtered, preserving only the correct and sufficientlygeneral ones

Le Minh et al [70] follow a similar strategy by generatingpattern lists from training data using the dependency graphsresulting from application of a deep syntactic parser.Bui et al [99] present one of the most recent studies based

on dictionaries and patterns automatically generated from atraining set In this work, less than one minute was required

to process a training set composed of about 950 abstracts

on a computer with 4 gigabytes of memory, illustrating amain advantage of rule-based systems Unfortunately, despitethe low computational requirements, this kind of approachusually shows modest performance in terms of recall, due tothe difficulty in modeling more complex relationships and indefining rules capable of generalizing

3.5.2 Machine Learning-Based Approach to Edge Detection.

In recent years, similarly to trigger detection, there has been

a clear tendency to approach the edge detection task usingmachine learning methods Most works agree on addressingthis problem as a supervised multiclass classification problem

by defining a limited number of edge classes

As can be seen in Table 3, most approaches are based

on SVMs Miwa et al [87] presented one such approach,dividing the task into two different classification problems:edge detection between two triggers and edge detectionbetween a trigger and a protein For this purpose a set ofannotated instances is constructed from a training set, asfollows: for each event found in the training set, a list ofannotated edges is constructed using as label the combination

of the corresponding event class and the edge type (e.g.,Binding: Theme) Using these extracted annotated edges, anunbalanced classification problem is then solved using one-versus-rest linear SVMs Bj¨orne et al [64] and Wang et

al [65] followed similar approaches, using multiclass SVMs

in which two kinds of edges are annotated: trigger-trigger

and trigger-protein Each example is classified as Theme,

Cause, or Negative denoting the absence of an edge between

the two nodes Each edge is predicted independently, sothat the classification is not affected by positive or negativeclassification of other edges

Roller and Stevenson [68] evaluated a similar strategy,using a polynomial kernel The classification of the relations

is carried out in three stages The first consists of theidentification of basic events by defining the trigger and

a theme referring to a protein; the second stage seeks toidentify regulation events by defining the trigger and a themereferring to a trigger from a previously identified basic event;and the final stage tries to identify additional arguments

Trang 30

Theme Theme

Table 3: Most relevant work addressing the problem of edge detection Studies are listed in chronological order and the different approachesare classified in three main groups: rule-based, dictionary-based, and ML-based strategies

Hakala et al [91] proposed a reranking approach that uses

the prediction scores of a first SVM classifier and information

about the event structure as inputs for a new SVM model

focused on optimizing the ranking of the predicted edges

For this new model, polynomial and radial basis kernels were

evaluated, showing an improvement in the overall precision

of the system

A different strategy was used by Zhou and He [89], who

proposed a method based on a Hidden Vector State model,

called HVS-BioEvent Although this method presented lower

performance in basic events, compared to systems based on

SVM classifiers, it achieved better performance in complex

events due to the hierarchical hidden state structure This

structure is indeed more suitable for complex event

extrac-tion since it can naturally model embedded structural context

in sentences

Van Landeghem et al [74] proposed an approach that

processes each type of event in parallel using binary SVMs

All predictions are assembled in an integrated graph, onwhich heuristic postprocessing techniques are applied toensure global consistency Linear and radial base function(RBF) kernels were evaluated by performing parametertuning via 5-fold cross-validation Van Landeghem et al.made an interesting exploration about feature selection; theyapplied fully automated feature selection techniques aimed atidentifying a subset of the most relevant features from a largeinitial set of features An analysis of the results showed that

up to 50% of all features can be removed without losing morethan one percentage point in𝐹-score, while at the same timecreating faster classification models

3.5.3 Hybrid Approaches In the literature, we can find

many studies that combine ML-based with rule-based anddictionary-based strategies This combination is often per-formed in two ways: (1) in an ensemble strategy, each method

Trang 31

is performed independently and the final output is obtained

by combining the results of each method, either through rules

or by using some classification or regression model; and (2) in

a stacked strategy, the output of one method is used as input

for the following one that performs a filtering and refining

process to produce a more accurate final output

As an example of the first kind of approach, Pham

et al [100] proposed a hybrid system that combines both

rule-based and machine learning-based approaches In this

method, the final list of predicted events is given by the

com-bination of the events extracted by rule-based methods based

on syntactic and dependency graphs and those extracted via

SVM classifiers In the second kind of approach, several

stud-ies [68, 80, 97] have used a rule-based postprocessing step

to refine the initial resulting graph generated by ML-based

classifiers by eliminating duplicate nodes and separating their

edges into valid combinations based on the syntax of the

sentences and the conditions in argument type combinations,

taking into account the characteristics and peculiarities of

each kind of event

3.5.4 Structured Prediction and Joint Models To address

the potential cascading errors that originate from two-stage

approaches described above, some authors have proposed the

joint prediction of triggers, event participants, and

connect-ing edges Riedel et al [101] and Poon and Vanderwende [102]

proposed two methods based on Markov logic Markov logic

is an extension to first-order logic in which a probabilistic

weight is attached to each clause [103] Instead of using

the relational structures over event entities, as represented

in Figure 4, Riedel et al represent these as labeled links

between tokens of the sentence and apply link prediction

over token sequences As stated by the authors, this

link-based representation simplifies the design of the Markov

Logic Network (MLN) Poon and Vanderwendle, on the

other hand, used Markov logic to model the dependency

edges obtained with the Stanford dependency parser The

resulting MLN therefore jointly predicts if a token is a

trigger word, the corresponding event type, and which of

the token’s dependency edges connect to (Theme or Cause)

event arguments This allows using a simpler set of features

in the MLN, which leads to a more computationally efficient

solution without sacrificing the prediction performance The

authors used heuristics to fix two typical parsing errors,

namely, propositional phrase attachment and coordination,

and showed that this had an important impact on the final

results

Riedel and McCallum [104] proposed another approach

in which the problem is decomposed in three submodels: one

for extracting event triggers and outgoing edges, one for event

triggers and incoming edges, and one for protein-protein

bindings The optimization methods for the three submodels

are combined via dual decomposition [105], with three types

of constraints enforced to achieve a joint prediction model

Links between tokens are represented through a set of binary

variables as in Riedel et al [101]

McClosky et al [98] proposed a different approach,

in which event structures are converted into dependencies

between event triggers and event participants Various dency parsers are trained using features from these depen-dency trees as well as features extracted from the original sen-tences In recognition phase, the parsing results are convertedback to event structures and ranked by a maximum-entropyreranker component

depen-Vlachos and Craven [106] applied the search-based tured prediction framework (SEARN) to the problem of eventextraction This approach decomposes event extraction intojointly learning classifiers for a set of classification tasks, inwhich each model can incorporate features that representthe predictions made by the other ones Moreover, the lossfunction incorporates all predictions, which means that themodels are jointly learned and a structured prediction isachieved For this specific task, models were trained toclassify each token as a trigger or not and to classify eachpossible pair of trigger-theme and trigger-cause in a sentence

struc-3.6 Modality Detection Modality detection refers to the

crucial part of identifying negations and speculations [107].The aim of this task is to avoid opposite meanings and todistinguish when a sentence can be interpreted as subjective

or as a nonfactual statement The detection of speculations(also referred to as hedging) in the biomedical literature hasbeen the focus of several recent studies, since the ability todistinguish between factual and uncertain information is ofvital importance for any information extraction task [108]

In many approaches, modality detection is addressed as

an extra phase following the edge detection process Mostapproaches address this problem in two steps: first specu-lation/negation cues (which may be words such as “may,”

“might,” “suggest,” “suspect,” and “seem,”) are detected, and,next, the scope of the cues is analyzed Most of the initialsystems were rule-based and relied on lexical or syntacticinformation, but recent studies have looked at solving thisproblem using binary classifiers [64, 78, 85] trained withgenerated instances annotated as negation, speculation, ornegative (see Table 4)

4 Comparison of Existing Methods

In this section we present a comparative analysis of thedifferent approaches and systems described in this review Toachieve a consistent comparison, we use the results achieved

by the different systems on the standard datasets fromthe BioNLP shared tasks on event extraction [51, 52, 109].These datasets provide a direct point of comparison and arecommonly used to validate and evaluate new approaches anddevelopment, which endorses their use in this comparativeanalysis The datasets are based on the GENIA corpus[53], consisting of a training set with 800 abstracts and adevelopment set with 150 abstracts The test data, composed

of 260 abstracts, comes from an unpublished portion of thecorpus For the second edition of the challenge, this initialdataset was extended with 15 full-text articles, equally dividedinto training, development, and test portions Evaluation

is performed with standard recall, precision, and 𝐹-scoremetrics

Trang 32

Table 4: Modality detection Most relevant work addressing the problem of modality detection classified in rule-based, dictionary-based,and ML-based strategies.

4.1 BioNLP Shared Task on Event Extraction The BioNLP

shared task series is the main community-wide effort to

address the problem of event extraction, providing a

stan-dardized dataset and evaluation setting to compare and verify

the evolution in performance of different methods Since

its initial organization in 2009, the BioNLP-ST series has

defined a number of fine-grained information extraction

(IE) tasks motivated by bioinformatics projects In this

analysis, we focus on the main task, GENIA Event Extraction

(GE) This task focuses on the recognition of biomolecular

events defined in the GENIA Event Ontology, from scientific

abstracts or full papers From the first edition three separate

subtasks have been defined, each addressing the event

extrac-tion with a different level of specificity

Task 1 Core event extraction: it consists of the identification

of trigger words, associated with 9 events related to protein

biology The annotation of protein occurrences in the text,

used as arguments for event triggers, is provided in both the

training and the test sets

Task 2 Event enrichment: it is recognition of secondary

arguments that further specify the events extracted in Task

1

Task 3 Negation/speculation detection: it is detection of

negations and speculation statements concerning extracted

events

4.1.1 Target Event Types The shared task defined a subset of

nine biomolecular events from the GENIA Event Ontology,

classified in three kinds with different levels of complexity:

basic events, binding events, and regulation events Basic

events are the simplest to fully resolve, because these only

require the specification of a primary argument Five types

of events are categorized in this group: gene expression,

tran-scription, protein catabolism, phosphorylation, and

localiza-tion Binding events, on the other hand, require the detection

of at least two arguments Finally, regulation events, including

Negative and Positive Regulation, are the most difficult to

fully specify, because these involve the definition of anotherargument, which may be an entity or another event, requiringidentification of a recursive structure

4.2 Comparative Analysis 4.2.1 Core Event Extraction Table 5 summarizes the per-

formance achieved by the most representative strategiesaddressing the core event extraction subtask (Task 1) Thebest results achieved during the first edition of the BioNLP-

ST were obtained through machine learning techniques,formulating the problems of trigger and edge detection asdifferent multiclass classification problems, solved by usinglinear SVM classifiers [86] Using the same approach, Miwa

et al [87] reported improvements over these results by adding

a set of shortest path features between triggers and proteinsfor the edge detection problem As can be observed from thetable, a considerable improvement was obtained for bindingevents, with an increase of over 12 percentage points in recalland 3 points in precision

In BioNLP-ST 2011, the datasets were extended to includefull text articles, but the abstract collection used for the firstedition was maintained in order to measure the progressbetween the two editions The best result in the secondedition, an 𝐹-score of 57.46% when considering only theabstracts, was obtained by the FAUST system This corre-sponds to a substantial increase of more than four percentagepoints over the previous best system, resulting from animprovement in the recognition of simple events but espe-cially from a much better recognition of complex regulationevents, with an increase of over 11 percentage points inprecision and 3 points in recall

The FAUST system consists of a stacked combination

of two models: the Stanford event parser [98] was usedfor constructing dependency trees that were then used asadditional input features for the second model, the UMassmodel [104] The main distinction of the UMass model is that

it performs joint prediction of triggers, arguments, and eventstructures, therefore overcoming the cascading errors thatoccur in the common pipeline approaches when, for example,

Trang 33

Table 5: Core event extraction performance comparison BioNLP shared task comparison results in recall/precision/F-score (%) on the test

set for Task 1 (core event extraction) (A) abstracts only and (F) full papers Data extracted from BioNLP-ST 2009, BioNLP-ST 2011, andBioNLP-ST 2013 overviews [51, 52, 109]

66.16/81.04/72.8575.58/78.23/76.88

45.53/58.09/51.0540.97/44.70/42.75

39.38/58.18/46.9734.99/48.24/40.56

50.00/67.53/57.4647.92/58.47/52.67UMass

Riedel and McCallum [104]

(A)(F)

64.21/80.74/71.5475.58/83.14/79.18

43.52/60.89/50.7641.67/47.62/44.44

38.78/55.07/45.5134.72/47.51/40.12

48.74/65.94/56.0547.84/59.76/53.14

2013

EVEX

Hakala et al [91] (F) 73.83/79.56/76.59 41.14/44.77/42.88 32.41/47.16/38.41 45.44/58.03/50.97TEES-2.1

Bj¨orne and Salakoski [97] (F) 74.19/79.64/76.82 42.34/44.34/43.32 33.08/44.78/38.05 46.17/56.32/50.74BioSEM

Bui et al [99] (F) 67.71/86.90/76.11 47.45/52.32/49.76 28.19/49.06/35.80 42.47/62.83/50.68

a trigger is not correctly predicted in the first stage [111] In

this model, the problem of event extraction is divided into

smaller simple subproblems that are solved individually, with

each subproblem presenting a set of penalties that are added

to an objective function The final solution is found via an

iterative tuning of the penalties until all individual solutions

are consistent with each other When used separately, the

UMass model achieved the second best-performing results

in this edition and was the top performing system when

considering just full-texts In its third edition, BioNLP-ST

focused on simulating a more realistic scenario For this

reason, a new dataset was constructed using only recent full

papers, so that the extracted information could represent

up-to-date knowledge of the domain Unfortunately, the

collection of abstracts used in the first two editions

(BioNLP-ST 2009 and BioNLP-(BioNLP-ST 2011) was removed from the official

evaluation and the full text collection used in the 2011 edition

corresponds only to a small part of dataset used in this

edition, making it difficult to compare against previous results

and measure the progress of the community

In this latest edition of the shared task the

best-performing systems were EVEX [91] and TEES [97] TEES,

an evolution of the UTurku system and also mainly based

on SVM classifiers, introduces an automated annotation

scheme learning system that derives task-specific event rules

and constraints from the training data In turn, EVEX is a

combined system that takes the outputs predicted by TEES

and tries to reduce false positives by applying a reranking

that assigns a numerical score to events and removing all

events that are below a defined threshold For this reranking,

SVMrankis used with a set of features based on confidence

scores (i.e., maximum/minimum trigger confidence and

maximum/minimum argument confidence, among others)

and features describing the structure of the event (i.e., event

type of the root trigger and paths in the event from root

to arguments, among others) This reranking and filtering

approach provided a small overall improvement, achieved

through a better precision in the definition of regulationevents, which constitute a substantial fraction of the anno-tated data [105]

BioSEM [99], a rule-based system based on patternsautomatically derived from annotated events also achievedhigh performance results, with only marginal differences tothe machine learning approaches described above BioSEMlearns patterns of relations between an event trigger and itsarguments defined at three different levels: chunk, phrase,and clause Notably, this system presents significantly greaterprecision than ML-based systems, especially consideringsimple and binding events with improvements of more thanseven percentage points While in the case of simple eventsthis was accompanied by a decrease in recall, for bindingevents this rule-based system achieved the best results with

a difference of over six percent in 𝐹-score These resultsindicate that although ML methods still produce the bestgeneralization, rule-based systems can approximate thoseresults with much better precision and further suggests thecombination of the two approaches

4.2.2 Event Enrichment Table 6 shows the results obtained

in the BioNLP-ST Task 2, which consists of the recognition

of secondary event arguments These secondary arguments

depend on the type of event and include Location arguments (i.e., AtLoc or ToLoc) that define the source or destination of

an event and Site arguments (i.e., Site or Csite) that indicate

domains or regions to better specify the Theme or Cause of anevent The settings of this subtask changed between editions,not only in terms of the dataset used, but also in terms of thesites to be predicted as secondary arguments This means thatthe results shown in the table are not directly comparable,namely, for the last edition of the challenge in which sitesfor different protein modification and regulation events werealso considered Nevertheless, these results were included forreference

Trang 34

Table 6: Event enrichment performance comparison BioNLP shared task comparison results in recall/precision/F-score (%) on the test set

for Task 2 (event enrichment) (A) abstracts only and (F) full papers Data extracted from BioNLP-ST 2009, BioNLP-ST 2011, and BioNLP-ST

43.51/71.25/54.0317.58/69.57/28.07

36.92/77.42/50.00

—

41.33/72.97/52.7717.39/66.67/27.59UMass

Riedel and McCallum (b)

[104]

(A)(F)

42.75/70.00/53.0816.48/75.00/27.03

36.92/77.42/50.00

—

40.82/72.07/52.1216.30/75.00/26.79

Only phosphorylation sites were considered.

b The results are for overall binding and phosphorylation sites.

c The task included the prediction of sites for other protein modification and regulation events.

Table 7: Negation and speculation detection performance comparison BioNLP shared task comparison results in recall/precision/F-score

(%) on the test set for Task 3 (negation/speculation detection) (A) abstracts only and (F) full papers only Data extracted from BioNLP-ST

2009, BioNLP-ST 2011, and BioNLP-ST 2013 overviews [51, 52, 109]

22.03/49.02/30.4025.76/48.28/33.59

19.23/38.46/25.6415.00/23.08/18.18

20.69/43.69/28.0819.28/30.85/23.73ConcordU11

Kilicoglu and Bergler [93]

(A)(F)

18.06/46.59/26.0321.21/38.24/27.29

23.08/40.00/29.2717.00/34.69/22.82

20.46/42.79/27.6818.67/36.14/24.63

Considering the analysis of abstracts, the table shows

an evident improvement on the results achieved by the top

performing systems in the first and second editions More

interestingly, there is a considerable difference between the

results achieved over full-texts and the results obtained on

abstracts This is an indication that, as expected, the language

used for describing the events is much more complex in the

main body of the articles, where events are specified in more

detail, than in the abstracts Moreover, while the events are

predicted with acceptable levels of precision, the recall is

much lower, especially in full-texts

4.2.3 Negation and Speculation Detection Table 7 shows the

best-performing systems in Task 3, corresponding to the

identification of negations and speculations In the second

edition only two teams participated in this task, both

present-ing an important improvement over the best result of 2009

(ConcordU09 [84]), with UTurku [64, 77] showing a better

performance in extracting negated events, and ConcordU11

[93] showing a better performance in extracting speculated

events and better overall results in terms of full-texts As

can be directly seen from lower precision and recall rates

achieved, this task is considerably more difficult than theextraction of secondary arguments Although the dataset isdifferent, preventing direct comparison, the results achievedfor full-texts on the last edition of the task were similar to theprevious results

5 Discussion and Future Research Directions

Biomolecular event extraction consists of identifying ations in the state of a biomolecule or interactions betweentwo or more biomolecules, described in natural languagetext in the scientific literature These events constitute thebuilding blocks of biological processes and functions, andautomatically mining their descriptions has the potential ofproviding insights for the understanding of physiologicaland pathogenesis mechanisms Event extraction has beenaddressed through multiple approaches, starting from basicpattern matching and parsing techniques to machine learningmethods

alter-Despite the steady progress shown over the last decade,the current state-of-the-art performance clearly shows thatextracting events from biomedical literature still presents

Trang 35

various challenges While performance results close to 80%

in𝐹-score have been achieved in the recognition of simpler

events, the extraction of more complex events such as binding

and regulation events is still limited Although substantial

efforts have been made for the recognition of these events,

the best performance achieved remains 30%–40% lower than

that for simple events

5.1 Patterns and Matching Rules versus Machine

Learning-Based Approaches Biomedical event extraction has been

moving from purely rule-based and dictionary-based

approaches towards ML-based solutions, due to the difficulty

in creating sufficiently rich rules that capture the variability

and ambiguity of natural language, leading to limited

generalization capability and lower recall Nonetheless, the

automatic extraction of rules from annotated data may

help in obtaining richer rules In the third edition of the

BioNLP-ST, for instance, the rule-based BioSEM system

presented significantly higher precision than the best ML

approaches, although with a lower recall

On the other hand, and despite showing the best

per-formance results in a shared task setting, machine

learn-ing approaches present important drawbacks, namely, their

dependence on sufficiently large and high-quality training

datasets Another important limitation is that even if such

a dataset exists, as in the case of evaluation tasks, its focus

may be too restricted which could mean that a model trained

on these data would be well tuned for extracting information

from similar documents but could become unusable in

a slightly different domain Many recent advances in this

task have come from the combination of different systems

and approaches For example, rule-based systems have been

applied to derive constrains from the manually annotated

data that are then used to correct or filter the results of the

machine learning-based event extraction Another option is

to combine the results of rule-based and ML-based methods

in an ensemble approach

5.2 Feature Selection and Feature Reduction The feature

extraction process generates a wide range of features of

different nature In many studies, the generation of the final

data representation consists of extracting as many features as

possible and integrating them in a basic way This produces

a high dimensional space that does not take into account

multiple aspects regarding the nature of the data, such as

redundancy, noisy information, or the complexity of its

representation space Although some studies have tried to

address this problem, this has mainly been from the point

of view of reducing the dimensionality Some works have

shown that an analysis of the contribution of features and

appropriate selection of these can significantly reduce the

computational requirements For instance, Campos et al [42]

proposed a solution that chooses the features that better

reflect the linguistic characteristics of the triggers for a

particular event type; these features are automatically selected

via an optimization problem Also, Van Landeghem et al [74]

showed that a similar overall performance could be achieved

using less than 50% of the originally extracted features

Another important consideration is that this reduction notonly avoids extra processing time but also helps to avoidundesirable noise [92]

5.3 Current Trends and Challenges Most event extraction

strategies split the problem into two main steps: a first stepconsisting of the identification of trigger words that indicatethe events and a second step (edge detection) that fullyspecifies the events by adding the corresponding arguments.This makes trigger word detection a crucial task in eventextraction, since the second step is commonly performedover the results of that process In fact, some studies haveshown that missing triggers cause about 70% of all errors inevent detection [89] To address these cascading errors, someauthors have proposed the joint prediction of triggers andedges connecting these triggers to participants in the event[101, 102, 104, 106, 112] As shown by the comparative results,this joint inference allowed the most significant advances interms of prediction performance and constitutes the state-of-the-art approach for event detection Structured predictionand jointly trained models have also been applied successfully

in other biomedical information extraction tasks Berant

et al [113], for example, used event extraction in order toimprove fine-grained information extraction for questionanswering, applying the structured averaged perceptron algo-rithm to jointly extract the event triggers and arguments.Kordjamshidi et al [114] applied structured prediction to thetask of extracting information on bacteria and their locations(e.g., host organism) by jointly identifying mentions of enti-ties, organisms, and habitats and corresponding localizationrelationship They used a set of local and contextual featuresfor words and phrases and for pairs of phrases and trainedstructured SVMs for jointly extracting the information.The use of postprocessing rules to filter and refine theresults of model predictions has proved to be an essentialstep in event extraction These rules are usually automaticallyobtained from annotated data and reflect restrictions orlikelihoods for the creation of edges between triggers andparticipants in the construction of the events On the otherhand, the application of automatically extracted rules, ontheir own, has also shown positive results as shown by theBioSEM system The ensemble combination of this strategywith the results from ML models could provide a way ofbalancing the precision and recall of each approach

While the initial efforts in this task focused on the analysis

of abstracts, this greatly limits the amount of information thatcan be extracted and therefore the impact of these methods

on downstream applications, such as question answering,network construction and curation, or knowledge discovery.The latest attempts have therefore focused on mining full-textdocuments but, as expected, the precision of event extractionusing the full body is lower due to the more complex languageused in the main text of the publications Interestingly, theresults obtained have shown that while the recognition ofcomplex events becomes more difficult in full-texts, therecognition performance for simple events is higher.Improving the extraction of complex events, namely, fromfull-text documents, either through rules, ML, or hybrid

Trang 36

approaches, may depend on the amount and quality of the

training data However, the construction of a fully annotated

large-scale dataset that covers the wide variety of linguistic

patterns would be a very demanding and unfeasible task

To overcome this, repositories with large amounts of

nonan-notated data, such as PubMed, could be exploited by

unsu-pervised and semisuunsu-pervised machine learning methods, to

construct richer text representations that can better model

complex relations between words This is a very promising

research direction due to the large amount of available data

[1] but, unfortunately, very few studies try to take advantage

of this unstructured information (i.e., raw text without

annotations) Another interesting aspect that could also be

further explored is the incorporation of domain information

in resources such as dictionaries, thesaurus, and ontologies

Related concepts and semantic relations obtained from these

resources could be used to enrich the representation of

textual instances or to aid in the generation of filtering and

postprocessing rules

Another major challenge for event extraction is related

to coreferences and anaphoric expressions, which make the

correct identification of event participants more difficult This

is a very active research field in computational linguistics and

natural language processing and has also been vastly studied

in the specific case of biomedical text mining [75, 115, 116]

The second edition of the BioNLP-ST included coreference

resolution as a supporting task, in which the best participants

obtained results ranging from 55% to 73% in precision, for a

recall varying between 19% and 22% These results show that

there is still much room for improvement in this area, which

would also enhance the event extraction results

Additionally to the extraction of events, respective types,

and participants, a more complete specification of events

requires the identification of additional arguments, such

as specific binding sites, protein regions, or domains This

extraction of fine-grained information is inherently more

difficult than the primary identification of events, as can be

seen from the current state-of-the-art performance However,

this information is required if the automatically extracted

events are to be used for constructing biological networks [2]

Similarly, the identification of negation and speculation, also

addressed by various works and evaluated in the BioNLP-ST

setting, still represents a very difficult challenge Nonetheless,

even if current limitations still hinder the direct extraction of

reliable biological networks from scientific texts, the existing

methods can serve as an efficient aid to accelerate the process

of network extraction, when integrated in curation pipelines

that allow simple and user-friendly revision, correction, and

completion of the extracted information

6 Conclusions

This paper presents a review of the state-of-the-art in

biomolecular event extraction, which is a challenging task

due to the ambiguity and variability of scientific documents,

and the complexity of the biological processes described

Over the last decades a wide range of approaches have been

proposed, ranging from basic pattern matching and parsing

techniques to sophisticated machine learning methods

Current state-of-the-art methods use a stacked tion of models, in which the second model either uses rules

combina-to refine the initial predictions or applies reranking combina-to selectthe best event structures Additionally, the joint prediction ofthe full event structure as opposed to a two- or three-stageapproach has shown to produce improved results

Important challenges still exist, namely, in the extraction

of complex regulation events, in the resolution of ences, and in the identification of negation and speculation.Nonetheless, current methods can be used in text-mining-assisted curation pipelines, for network construction andpopulation of knowledge bases

corefer-Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

References

[1] M S Simpson and D Demner-Fushman, “Biomedical text

mining: a survey of recent progress,” in Mining Text Data, pp.

465–517, Springer, New York, NY, USA, 2012

[2] C Li, M Liakata, and D Rebholz-Schuhmann, “Biologicalnetwork extraction from scientific literature: state of the art and

challenges,” Briefings in Bioinformatics, vol 15, no 5, pp 856–

[4] S Ananiadou, P Thompson, R Nawaz et al., “Event-based

text mining for biology and functional genomics,” Briefings in

Functional Genomics, vol 14, no 3, pp 213–230, 2015.

[5] L Hirschman, G A P C Burns, M Krallinger et al., “Text

mining for the biocuration workflow,” Database: The Journal of

Biological Databases and Curation, vol 2012, Article ID bas020,

2012

[6] D Campos, S Matos, and J L Oliveira, “Current

method-ologies for biomedical named entity recognition,” in Biological

Knowledge Discovery Handbook: Preprocessing, Mining, and Postprocessing of Biological Data, pp 839–868, John Wiley &

Sons, 2013

[7] C N Arighi, Z Lu, M Krallinger et al., “Overview of the

biocre-ative III workshop,” BMC Bioinformatics, vol 12, supplement 8,

article S1, 2011

[8] I Segura-Bedmar, P Mart´ınez, and M Herrero-Zazo,

“Semeval-2013 task 9: extraction of drug-drug interactions from

biomed-ical texts (DDIExtraction 2013),” in Proceedings of the 7th

International Workshop on Semantic Evaluation (SemEval ’13),

pp 341–350, June 2013

[9] S Ananiadou, S Pyysalo, J Tsujii, and D B Kell, “Eventextraction for systems biology by text mining the literature,”

Trends in Biotechnology, vol 28, no 7, pp 381–390, 2010.

[10] U Hahn, K B Cohen, Y Garten, and N H Shah, “Mining thepharmacogenomics literature—a survey of the state of the art,”

Briefings in Bioinformatics, vol 13, no 4, pp 460–494, 2012.

[11] J.-D Kim, T Ohta, and J Tsujii, “Corpus annotation for mining

biomedical events from literature,” BMC Bioinformatics, vol 9,

article 10, 2008

Trang 37

[12] M Ashburner, C A Ball, J A Blake et al., “Gene ontology: tool

for the unification of biology,” Nature Genetics, vol 25, no 1, pp.

25–29, 2000

[13] K Sagae and J Tsujii, “Dependency parsing and domain

adapta-tion with LR models and parser ensembles,” in Proceedings of the

CoNLL Shared Task of EMNLP-CoNLL, pp 1044–1050, Prague,

Czech Republic, June 2007

[14] E Charniak and M Johnson, “Coarse-to-fine n-best parsing

and MaxEnt discriminative reranking,” in Proceedings of the

43rd Annual Meeting of the Association for Computational

Linguistics (ACL ’05), pp 173–180, June 2005.

[15] D McClosky, Any domain parsing: automatic domain

adapta-tion for natural language parsing [Ph.D thesis], Brown

Univer-sity, Providence, RI, USA, 2010

[16] D M Bikel, “Intricacies of collins’ parsing model,”

Computa-tional Linguistics, vol 30, no 4, pp 479–511, 2004.

[17] D Klein and C D Manning, “Accurate unlexicalized parsing,”

in Proceedings of the 41st Annual Meeting on Association for

Computational Linguistics (ACL ’03), vol 1, pp 423–430, ACM,

July 2003

[18] T Hara, Y Miyao, and J Tsujii, “Evaluating impact of

re-training a lexical disambiguation model on domain adaptation

of an HPSG parser,” in Proceedings of the 10th International

Conference on Parsing Technologies (IWPT ’07), pp 11–22,

Prague, Czech Republic, June 2007

[19] A A Copestake and D Flickinger, “An open source

gram-mar development environment and broad-coverage English

grammar using HPSG,” in Proceedings of the 2nd International

Conference on Language Resources and Evaluation (LREC ’00),

Athens, Greece, 2000

[20] Y Peng, C O Tudor, M Torii, C H Wu, and K Vijay-Shanker,

“iSimp in BioC standard format: enhancing the interoperability

of a sentence simplification system,” Database, vol 2014, Article

ID bau038, 2014

[21] Y Tsuruoka, Y Tateishi, J.-D Kim et al., “Developing a robust

part-of-speech tagger for biomedical text,” in Advances in

Informatics, vol 3746 of Lecture Notes in Computer Science, pp.

382–392, Springer, Berlin, Germany, 2005

[22] S Bird, E Klein, and E Loper, Natural Language Processing with

Python, O’Reilly Media, 2009.

[23] C D Manning, M Surdeanu, J Bauer, J Finkel, S Bethard,

and D McClosky, “The stanford corenlp natural language

processing toolkit,” in Proceedings of the 52nd Annual Meeting

of the Association for Computational Linguistics: System

Demon-strations, pp 55–60, Baltimore, Md, USA, June 2014.

[24] The opennlp project, 2005, http://opennlp.apache.org/index

[25] H Cunningham, V Tablan, A Roberts, and K Bontcheva,

“Getting more out of biomedical documents with GATE’s

full lifecycle open source text analytics,” PLoS Computational

Biology, vol 9, no 2, Article ID e1002854, 2013.

[26] Y Kano, W A Baumgartner, L McCrohon et al., “U-compare:

share and compare text mining tools with UIMA,”

Bioinformat-ics, vol 25, no 15, pp 1997–1998, 2009.

[27] D Campos, S Matos, and J L Oliveira, “Gimli: open source

and high-performance biomedical name recognition,” BMC

Bioinformatics, vol 14, article 54, 2013.

[28] NERsuite: A Named Entity Recognition toolkit, 2015, http://

nersuite.nlplab.org/

[29] C.-N Hsu, Y.-M Chang, C.-J Kuo, Y.-S Lin, H.-S Huang,

and I.-F Chung, “Integrating high dimensional bi-directional

parsing models for gene mention tagging,” Bioinformatics, vol.

24, no 13, pp i286–i294, 2008

[30] J Hakenberg, C Plake, R Leaman, M Schroeder, and G.Gonzalez, “Inter-species normalization of gene mentions with

GNAT,” Bioinformatics, vol 24, no 16, pp i126–i132, 2008.

[31] J Wermter, K Tomanek, and U Hahn, “High-performance gene

name normalization with GeNo,” Bioinformatics, vol 25, no 6,

pp 815–821, 2009

[32] R Klinger, C Kol´aˇrik, J Fluck, M Hofmann-Apitius, and C

M Friedrich, “Detection of IUPAC and IUPAC-like chemical

names,” Bioinformatics, vol 24, no 13, pp i268–i276, 2008.

[33] T Rockt¨aschel, M Weidlich, and U Leser, “Chemspot: a hybrid

system for chemical named entity recognition,” Bioinformatics,

[35] M Chowdhury and M Faisal, “Disease mention recognition

with specific features,” in Proceedings of the Workshop on

Biomedical Natural Language Processing, pp 83–90, Uppsala,

Sweden, July 2010

[36] R Leaman and G Gonzalez, “BANNER: an executable survey of

advances in biomedical named entity recognition,” in

Proceed-ings of the 13th Pacific Symposium on Biocomputing, pp 652–663,

January 2008

[37] B Settles, “ABNER: an open source tool for automatically

tagging genes, proteins and other entity names in text,”

Bioin-formatics, vol 21, no 14, pp 3191–3192, 2005.

[38] H Liu, Z.-Z Hu, J Zhang, and C Wu, “BioThesaurus: a

web-based thesaurus of protein and gene names,” Bioinformatics, vol.

22, no 1, pp 103–105, 2006

[39] Y Sasaki, S Montemagni, P Pezik, D Rebholz-Schuhmann, J.McNaught, and S Ananiadou, “BioLexicon: a lexical resource

for the biology domain,” in Proceedings of the 3rd International

Symposium on Semantic Mining in Biomedicine (SMBM ’08), pp.

109–116, September 2008

[40] O Bodenreider, “The unified medical language system (UMLS):

integrating biomedical terminology,” Nucleic Acids Research,

vol 32, pp D267–D270, 2004

[41] D Rebholz-Schuhmann, J.-H Kim, Y Yan et al., “Evaluationand cross-comparison of lexical entities of biological interest

(lexebi),” PLoS ONE, vol 8, no 10, Article ID e75185, 2013.

[42] D Campos, Q.-C Bui, S Matos, and J L Oliveira, “TrigNER:automatically optimized biomedical event trigger recognition

on scientific documents,” Source Code for Biology and Medicine,

vol 9, article 1, 2014

[43] Y Zhang, H Lin, Z Yang, J Wang, and Y Li, “Biomolecularevent trigger detection using neighborhood hash features,”

Journal of Theoretical Biology, vol 318, pp 22–28, 2013.

[44] C.-C Chang and C.-J Lin, “LIBSVM: a Library for support

vector machines,” ACM Transactions on Intelligent Systems and

Technology, vol 2, no 3, article 27, 2011.

[45] K Crammer and Y Singer, “On the algorithmic implementation

of multiclass kernel-based vector machines,” Journal of Machine

Learning Research, vol 2, pp 265–292, 2002.

[46] R.-E Fan, K.-W Chang, C.-J Hsieh, X.-R Wang, and C.-J

Lin, “LIBLINEAR: a library for large linear classification,” The

Journal of Machine Learning Research, vol 9, pp 1871–1874,

2008

[47] MALLET: A Machine Learning for Language Toolkit, 2002,http://mallet.cs.umass.edu

Trang 38

[48] T Kudo, “CRF++: Yet another CRF toolkit,” Software, 2005,

http://crfpp.sourceforge.net

[49] M M Stark and R F Riesenfeld, “Wordnet: an electronic lexical

database,” in Proceedings of the 11th Eurographics Workshop on

Rendering, p 21, Brno, Czech Republic, 1998.

[50] T Joachims, “Training linear SVMs in linear time,” in

Pro-ceedings of the 12th ACM SIGKDD International Conference on

Knowledge Discovery and Data Mining, pp 217–226, August

2006

[51] J D Kim, T Ohta, S Pyysalo et al., “Overview of BioNLP’09

shared task on event extraction,” in Proceedings of the Workshop

on Current Trends in Biomedical Natural Language Processing:

Shared Task (BioNLP ’09), pp 1–9, Association for

Computa-tional Linguistics, Boulder, Colo, USA, 2009

[52] J.-D Kim, S Pyysalo, T Ohta, R Bossy, N Nguyen, and J Tsujii,

“Overview of BioNLP shared task 2011,” in Proceedings of the

BioNLP Shared Task 2011 Workshop, pp 1–6, Association for

Computational Linguistics, Stroudsburg, Pa, USA, June 2011

[53] J.-D Kim, T Ohta, K Oda, and J.-I Tsujii, “From text to

pathway: corpus annotation for knowledge acquisition from

biomedical literature,” in Proceedings of the Asia-Pacific

Bioin-formatics Conference (APBC ’08), pp 165–176, Imperial College

Press, Kyoto, Japan, January 2008

[54] S Pyysalo, F Ginter, J Heimonen et al., “BioInfer: a corpus

for information extraction in the biomedical domain,” BMC

Bioinformatics, vol 8, article 50, 2007.

[55] P Thompson, S A Iqbal, J McNaught, and S Ananiadou,

“Construction of an annotated corpus to support biomedical

information extraction,” BMC Bioinformatics, vol 10, article

349, 2009

[56] E Buyko, E Beisswanger, and U Hahn, “The genereg corpus

for gene expression regulation events—an overview of the

corpus and its in-domain and out-of-domain interoperability,”

in Proceedings of the 7th International Conference on Language

Resources and Evaluation (LREC ’10), N Calzolari, K Choukri,

B Maegaard et al et al., Eds., p 1921, European Language

Resources Association (ELRA), Valletta, Malta, 2010

[57] The LLL corpus, 2015,

http://genome.jouy.inra.fr/texte/LLLchal-lenge/

[58] The AIMed corpus, 2015, ftp://ftp.cs.utexas.edu/pub/mooney/

bio-data/

[59] K Raghunathan, H Lee, S Rangarajan et al., “A multi-pass sieve

for coreference resolution,” in Proceedings of the Conference on

Empirical Methods in Natural Language Processing (EMNLP ’10),

pp 492–501, October 2010

[60] Y Peng, M Torii, C H Wu, and K Vijay-Shanker, “A

gener-alizable NLP framework for fast development of pattern-based

biomedical relation extraction systems,” BMC Bioinformatics,

vol 15, article 285, 2014

[61] R S T Y Miyao, K Sagae, T Matsuzaki, and J Tsujii,

“Task-oriented evaluation of syntactic parsers and their

represen-tations,” in Proceedings of the 46th Annual Meeting of the

Association for Computational Linguistics: Human Language

Technologies, Columbus, Ohio, USA, June 2008.

[62] S Pyysalo, T Ohta, M Miwa, H.-C Cho, J Tsujii, and S

Ana-niadou, “Event extraction across multiple levels of biological

organization,” Bioinformatics, vol 28, no 18, pp i575–i581, 2012.

[63] D Okanohara, Y Miyao, Y Tsuruoka, and J Tsujii, “Improving

the scalability of semi-Markov conditional random fields for

named entity recognition,” in Proceedings of the 21st

Interna-tional Conference on ComputaInterna-tional Linguistics and the 44th

Annual Meeting of the Association for Computational Linguistics,

pp 465–472, Association for Computational Linguistics, ney, Australia, 2006

Syd-[64] J Bj¨orne, F Ginter, and T Salakoski, “University of turku in the

bionlp’11 shared task,” BMC Bioinformatics, vol 13, supplement

11, article S4, 2012

[65] J Wang, Q Xu, H Lin, Z Yang, and Y Li, “Semi-supervised

method for biomedical event extraction,” Proteome Science, vol.

11, article S17, 2013

[66] S Riedel, R S˜atre, H.-W Chun, T Takagi, and J Tsujii,

“Bio-molecular event extraction with Markov logic,” Computational

Intelligence, vol 27, no 4, pp 558–582, 2011.

[67] L R McGrath, K Domico, C D Corley, and B.-J Robertson, “Complex biological event extraction from fulltext using signatures of linguistic and semantic features,” in

Webb-Proceedings of the BioNLP Shared Task 2011 Workshop, pp 130–

137, Association for Computational Linguistics, Portland, Ore,USA, June 2011

[68] R Roller and M Stevenson, “Identification of genia events using

multiple classifiers,” in Proceedings of the BioNLP Shared Task

2013 Workshop, pp 125–129, Association for Computational

Linguistics, Sofia, Bulgaria, August 2013

[69] D Campos, S Matos, and J L Oliveira, “A modular framework

for biomedical concept recognition,” BMC Bioinformatics, vol.

14, article 281, 2013

[70] Q Le Minh, S N Truong, and Q H Bao, “A pattern approach

for biomedical event annotation,” in Proceedings of the BioNLP

Shared Task 2011 Workshop, pp 149–150, Association for

Com-putational Linguistics, Stroudsburg, Pa, USA, 2011

[71] L Tanabe, N Xie, L H Thom, W Matten, and W J Wilbur,

“GENETAG: a tagged corpus for gene/protein named entity

recognition,” BMC Bioinformatics, vol 6, supplement 1, article

S3, 2005

[72] X Liu, A Bordes, and Y Grandvalet, “Biomedical event tion by multi-class classification of pairs of text entities,” in

extrac-Proceedings of the BioNLP Shared Task 2013 Workshop, pp 45–

49, Association for Computational Linguistics, Sofia, Bulgaria,August 2013

[73] D Martinez and T Baldwin, “Word sense disambiguation for

event trigger word detection in biomedicine,” BMC

Bioinfor-matics, vol 12, supplement 1, article S4, 2011.

[74] S Van Landeghem, B De Baets, Y de Peer, and Y Saeys,

“High-precision bio-molecular event extraction from text using

parallel binary classifiers,” Computational Intelligence, vol 27,

[76] A Vlachos, P Buttery, D ´O S´eaghdha, and T Briscoe,

“Biomed-ical event extraction without training data,” in Proceedings of the

Workshop on Current Trends in Biomedical Natural Language Processing: Shared Task, pp 37–40, Boulder, Colo, USA, 2009.

[77] J Bj¨orne and T Salakoski, “Generalizing biomedical event

extraction,” in Proceedings of the BioNLP Shared Task 2011

Workshop, pp 183–191, ACM, Portland, Ore, USA, June 2011.

[78] M Miwa, S Pyysalo, T Ohta, and S Ananiadou, “Widecoverage biomedical event extraction using multiple partially

overlapping corpora,” BMC Bioinformatics, vol 14, no 1, article

175, 2013

Trang 39

[79] H Kilicoglu and S Bergler, “Effective bio-event extraction using

trigger words and syntactic dependencies,” Computational

Intel-ligence, vol 27, no 4, pp 583–609, 2011.

[80] J Bj¨orne, F Ginter, S Pyysalo, J Tsujii, and T Salakoski,

“Complex event extraction at pubmed scale,” Bioinformatics,

vol 26, no 12, pp i382–i390, 2010

[81] G Zhou, J Zhang, J Su, D Shen, and C Tan, “Recognizing

names in biomedical texts: a machine learning approach,”

Bioinformatics, vol 20, no 7, pp 1178–1190, 2004.

[82] M Krallinger, O Rabal, F Leitner et al., “The CHEMDNER

corpus of chemicals and drugs and its annotation principles,”

Journal of Cheminformatics, vol 7, supplement 1, article S2, 2015.

[83] D Campos, S Matos, and J L Oliveira, “Biomedical named

entity recognition: a survey of machine-learning tools,” in

Theory and Applications for Advanced Text Mining, chapter 8,

pp 175–195, InTech, Rijeka, Croatia, 2012

[84] H Kilicoglu and S Bergler, “Syntactic dependency based

heuristics for biological event extraction,” in Proceedings of the

Workshop on Current Trends in Biomedical Natural Language

Processing: Shared Task, pp 119–127, Association for

Computa-tional Linguistics, Boulder, Colo, USA, 2009

[85] A MacKinlay, D Martinez, and T Baldwin, “Biomedical event

annotation with CRFs and precision grammars,” in Proceedings

of the Workshop on Current Trends in Biomedical Natural

Language Processing: Shared Task, pp 77–85, Boulder, Colo,

USA, June 2009

[86] J Bj¨orne, J Heimonen, F Ginter et al., “Extracting complex

biological events with rich graph-based feature sets,” in

Proceed-ings of the Workshop on Current Trends in Biomedical Natural

Language Processing: Shared Task, pp 10–18, 2009.

[87] M Miwa, R Sætre, J.-D Kim, and J Tsujii, “Event extraction

with complex event classification using rich features,” Journal of

Bioinformatics and Computational Biology, vol 8, no 1, pp 131–

146, 2010

[88] A Casillas, A D de Ilarraza, K Gojenola, M Oronoz, and G

Rigau, “Using kybots for extracting events in biomedical texts,”

in Proceedings of the BioNLP Shared Task 2011 Workshop, pp.

138–142, Portland, Ore, USA, June 2011

[89] D Zhou and Y He, “Biomedical events extraction using the

hidden vector state model,” Artificial Intelligence in Medicine,

vol 53, no 3, pp 205–213, 2011

[90] L Qian and G Zhou, “Tree kernel-based protein-protein

interaction extraction from biomedical literature,” Journal of

Biomedical Informatics, vol 45, no 3, pp 535–543, 2012.

[91] K Hakala, S Van Landeghem, T Salakoski et al., “EVEX in

ST’13: application of a large-scale text mining resource to event

extraction and network construction,” in Proceedings of the

Computational Linguistics, Sofia, Bulgaria, August 2013

[92] J Xia, A C Fang, and X Zhang, “A novel feature selection

strategy for enhanced biomedical event extraction using the

Turku system,” BioMed Research International, vol 2014, Article

ID 205239, 12 pages, 2014

[93] H Kilicoglu and S Bergler, “Adapting a general semantic

interpretation approach to biological event extraction,” in

Pro-ceedings of the BioNLP Shared Task 2011 Workshop, pp 173–182,

Association for Computational Linguistics, Portland, Ore, USA,

June 2011

[94] J D Lafferty, A McCallum, and F C N Pereira,

“Condi-tional random fields: probabilistic models for segmenting and

labeling sequence data,” in Proceedings of the 18th International

Conference on Machine Learning (ICML ’01), pp 282–289,

Williamstown, Mass, USA, June-July 2001

[95] H M Wallach, “Conditional random fields: an introduction,”CIS Technical Report MS-CIS-04-21, 2004

[96] P Le-Hong, X H Phan, and T T Tran, “On the effect of the

label bias problem in part-of-speech tagging,” in Proceedings

of the IEEE RIVF International Conference on Computing and Communication Technologies, Research, Innovation, and Vision for the Future (RIVF ’13), pp 103–108, Hanoi, Vietnam, 2013.

[97] J Bj¨orne and T Salakoski, “TEES 2.1: automated annotation

scheme learning in the bionlp 2013 shared task,” in Proceedings

of the Bionlp Shared Task 2013 Workshop, pp 16–25, Association

for Computational Linguistics, Sofia, Bulgaria, August 2013.[98] D McClosky, M Surdeanu, and C D Manning, “Event

extraction as dependency parsing,” in Proceedings of the 49th

Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (HLT ’11), vol 1, pp 1626–1635,

Association for Computational Linguistics, Portland, Ore, USA,2011

[99] Q.-C Bui, D Campos, E van Mulligen, and J Kors, “Afast rule-based approach for biomedical event extraction,” in

Proceedings of the BioNLP Shared Task 2013 Workshop, pp 104–

108, Association for Computational Linguistics, Sofia, Bulgaria,August 2013

[100] X Q Pham, M Q Le, and B Q Ho, “A hybrid approach

for biomedical event extraction,” in Proceedings of the BioNLP

Shared Task 2013 Workshop, pp 121–124, Association for

Com-putational Linguistics, Sofia, Bulgaria, August 2013

[101] S Riedel, H.-W Chun, T Takagi, and J Tsujii, “A Markov logic

approach to bio-molecular event extraction,” in Proceedings

of the Workshop on Current Trends in Biomedical Natural Language Processing: Shared Task (BioNLP ’09), pp 41–49,

Stroudsburg, Pa, USA, 2009

[102] H Poon and L Vanderwende, “Joint inference for knowledge

extraction from biomedical literature,” in Proceedings of the

Human Language Technologies: The Annual Conference of the North American Chapter of the Association for Computational Linguistics (HLT ’10), pp 813–821, Association for Computa-

tional Linguistics, 2010

[103] M Richardson and P Domingos, “Markov logic networks,”

Machine Learning, vol 62, no 1-2, pp 107–136, 2006.

[104] S Riedel and A McCallum, “Robust biomedical event tion with dual decomposition and minimal domain adaptation,”

extrac-in Proceedextrac-ings of the BioNLP Shared Task 2011 Workshop, pp 46–

50, Association for Computational Linguistics, Stroudsburg, Pa,USA, June 2011

[105] N Komodakis, N Paragios, and G Tziritas, “MRF optimization

via dual decomposition: message-passing revisited,” in

Proceed-ings of the 11th IEEE International Conference on Computer Vision (ICCV ’07), pp 1–8, IEEE, Rio de Janeiro, Brazil, October

“Annotat-in Proceed“Annotat-ings of the 2nd Student Research Workshop Associated

with RANLP (RANLPStud ’11), pp 139–144, Hissar, Bulgaria,

September 2011

Trang 40

[108] R Morante and C Sporleder, “Modality and negation: an

introduction to the special issue,” Computational Linguistics,

vol 38, no 2, pp 223–260, 2012

[109] J D Kim, Y Wang, and Y Yasunori, “The genia event extraction

shared task, 2013 edition—overview,” in Proceedings of the

Computational Linguistics, Sofia, Bulgaria, August 2013

[110] S Van Landeghem, J Bj¨orne, C.-H Wei et al., “Large-scale event

extraction from literature with multi-level gene normalization,”

PLoS ONE, vol 8, no 4, Article ID e55814, 2013.

[111] S Riedel, D McClosky, M Surdeanu, A McCallum, and C D

Manning, “Model combination for event extraction in BioNLP

2011,” in Proceedings of the BioNLP Shared Task 2011 Workshop,

pp 51–55, Association for Computational Linguistics, Portland,

Ore, USA, June 2011

[112] H Liu, L Hunter, V Keˇselj, and K Verspoor, “Approximate

subgraph matching-based literature mining for biomedical

events and relations,” PLoS ONE, vol 8, no 4, Article ID e60954,

2013

[113] J Berant, V Srikumar, P.-C Chen et al., “Modeling biological

processes for reading comprehension,” in Proceedings of the

Empirical Methods in Natural Language Processing (EMNLP ’14),

October 2014

[114] P Kordjamshidi, D Roth, and M.-F Moens, “Structured

learn-ing for spatial information extraction from biomedical text:

bacteria biotopes,” BMC Bioinformatics, vol 16, article 129, 2015.

[115] N Nguyen, J.-D Kim, M Miwa, T Matsuzaki, and J Tsujii,

“Improving protein coreference resolution by simple semantic

classification,” BMC Bioinformatics, vol 13, article 304, 2012.

[116] K Yoshikawa, S Riedel, T Hirao et al., “Coreference based

event-argument relation extraction on biomedical text,” Journal

of Biomedical Semantics, vol 2, article S6, 2011.

Định dạng
Số trang	195
Dung lượng	18,04 MB
File đính kèm	Machine Learning and Network Methods.rar (16 MB)