1. Trang chủ
  2. » Thể loại khác

A hybrid approach to finding phenotype candidates in genetic text

60 5 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 60
Dung lượng 1,47 MB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

2 1.3 The challenges of phenotype entity recognition.. List of AbbreviationsBF Bodily feature CRF Conditional Random Field GGP Gene and gene product HMM Hidden Markov Model HPO the Human

Trang 1

VIETNAM NATIONAL UNIVERSITY, HANOI

UNIVERSITY OF ENGINEERING AND TECHNOLOGY

Trang 2

VIETNAM NATIONAL UNIVERSITY, HANOI

UNIVERSITY OF ENGINEERING AND TECHNOLOGY

Trang 3

A hybrid approach to finding phenotype

candidates in genetic texts

Le Hoang Quynh

Faculty of Information Technology University of Engineering and Technology Vietnam National University, Hanoi

Supervised by Associate Professor Ha Quang Thuy

A thesis submitted in fulfillment of the requirements

for the degree ofMaster of Science in Computer Science

November 2012

Trang 4

2

Trang 5

ORIGINALITY STATEMENT

‘I hereby declare that this submission is my own work and to the best of my knowledge

it contains no materials previously published or written by another person, or tial proportions of material which have been accepted for the award of any other degree

substan-or diploma at University of Engineering and Technology (UET/Coltech) substan-or any othereducational institution, except where due acknowledgement is made in the thesis Anycontribution made to the research by others, with whom I have worked with at Univer-sity of Engineering and Technology and National Institute of Informatic (Tokyo, Japan)

or elsewhere, is explicitly acknowledged in the thesis I also declare that the intellectualcontent of this thesis is the product of my own work, except to the extent that assistancefrom others in the project’s design and conception or in style, presentation and linguisticexpression is acknowledged.’

Hanoi, November 10th, 2012

Signed

Le Hoang Quynh

i

Trang 6

Named entity recognition (NER) has been extensively studied for the names ofgenes and gene products but there are few proposed solutions for phenotypes Phe-notype terms are expected to play a key role in inferring gene function in complexheritable diseases but are intrinsically difficult to analyse due to their complex se-mantics and scale In contrast to previous approaches we evaluate state-of-the-arttechniques involving the fusion of machine learning on a rich feature set with evi-dence from extant domain knowledge-sources The techniques are validated on twogold standard collections including a novel annotated collection of 112 abstracts de-rived from a systematic search of the Online Mendelian Inheritance of Man databasefor auto-immune diseases Encouragingly the hybrid model outperforms a HMM, aCRF and a pure knowledge-based method to achieve an F1 of 75.37 for BF andmicro average F1 of 84.01 for the whole system

Publications:

• Mai-Vu Tran, Tien-Tung Nguyen, Thanh-Son Nguyen, Hoang-Quynh Le Automatic Named Entity Set Expansion Using Semantic Rules and Wrappers for Unary Relations In Inter- national Conference on Asian Language Processing 2010 Page 170-173 Harbin, China; December 28-30, 2010, DOI: http://doi.ieeecomputersociety.org/10.1109/IALP.2010.73

• Hoang-Quynh Le, Mai-Vu Tran, Nhat-Nam Bui, Nguyen-Cuong Phan and Thuy Ha An Integrated Approach Using Conditional Random Fields for Named En- tity Recognition and Person Property Extraction in Vietnamese Text In Proceedings

Quang-of International Conference on Asian Language Processing 2011 Page 115-118 DOI: http://doi.ieeecomputersociety.org/10.1109/IALP.2011.37

• Nigel Collier, Mai-Vu Tran, Hoang-Quynh Le, Anika Oellrich, Ai Kawazoe, Martin May and Dietrich Rebholz-Schuhmann A hybrid approach to finding phenotype candidates

Hall-in genetic text In The 24th conference on Computational Linguistics (COLING 2012) Accepted as long paper.

Trang 7

First and foremost, I would like to express my deep gratitude to my sor, Assoc.Prof Ha Quang Thuy, for his patient guidance and continuous supportthroughout the years He always appears when I need help, and responds to queries

supervi-so helpfully and promptly

I would like to express my gratitude to the National Institute of Informatics (NII

- Tokyo, Japan) for giving me a great chance working at NII in the NII InternationalInternship program Then, I sincerely give my honest thanks and appreciation toAssoc.Prof Nigel H Collier, my internship supervisor at NII, for his great support

I would like to say thank you to all my teachers at university of Engineering andTechnology (VNU), who bring me many knowledge and experiences

I also want to thank my colleagues at the Knowledge and Technology laboratory(UET, VNU) and my classmate for their enthusiasm and promptly help

I sincerely acknowledge the Vietnam National University, NAFOSTED and theQG.10.38 project for some supporting finance to my master study

And thanks to all my friends who always be by my side and cheer me

Finally, this thesis would not have been possible without the support and love

of my family Thank you, mother and father Thanks brother and sister, thanks to

my nephew And thank you, my beloved husband Again, thank you and love all ofyou so much ♥

iii

Trang 8

Table of Contents

1.1 Motivation and problem definition 1

1.2 Phenotype definition 2

1.3 The challenges of phenotype entity recognition 3

2 Related works 6 2.1 Useful resources 6

2.1.1 GENIA and JNLPBA corpora 7

2.1.2 The online mendelian inheritance in man 7

2.1.3 The human phenotype ontology 8

2.1.4 The mammalian phenotype ontology 9

2.1.5 The unified medical language system 9

2.1.6 KMR corpus 10

2.2 Related researches 11

2.2.1 Baseline method: Khordad et al (2011) 11

3 Methods 16 3.1 Schema 16

3.2 Annotated data sources 20

3.3 Proposed model 22

3.3.1 Pre-processing 22

3.3.2 Machine learning labeler 22

3.3.3 Knowledge-based labeler 24

3.3.4 Merge results 25

4 Experimental results and evaluation 29 4.1 Metrics 29

4.2 Experiments on the KMR corpus 31

Trang 9

TABLE OF CONTENTS v

4.3 Experiments on the Phenominer corpus 32

4.4 Discussion 35

4.4.1 Discussion on corpora 35

4.4.2 Discussion on results 36

Trang 10

List of Figures

2.1 A visual example of HPO hierarchical structure 13

2.2 A visual example of MP hierarchical structure 14

2.3 Khordad et al (2011)’s system block diagram 15

3.1 An informal overview of bodily feature entity 17

3.2 Phenotype tagging architecture 27

3.3 Brat rapid annotation tool example 28

4.1 Column chart shows the experimental results on KMR corpus 32

4.2 Column chart shows the experimental results of BF entities on Phe-nominer corpus 34

4.3 Column chart shows the experimental results of GGP entities on Phe-nominer corpus 34

Trang 11

List of Tables

3.1 Referential semantics and scoping of mentions by entity type 19

3.2 List of auto-immune disease used to collect Phenominer corpus 21

3.3 Feature sets used in the machine learning labeler 24

3.4 Features exploited by the two learner models 24

4.1 Results for BF entity on the KMR corpus using models with partialmatching 31

4.2 Results for each entity on the Phenominer corpus using models withpartial matching 33

4.3 Sources of error by the Hybrid system on the KMR corpus 37

4.4 Sources of error by Khordad et al.’s system on the Phenominer corpus 38

4.5 Sources of error by the Hybrid system on the Phenominer corpus 39

vii

Trang 12

List of Abbreviations

BF Bodily feature

CRF Conditional Random Field

GGP Gene and gene product

HMM Hidden Markov Model

HPO the Human Phenotype Ontology

Trang 13

Chapter 1

Introduction

During the last decade biomedicine has developed tremendously Everyday a lot

of biomedical papers are published and a great amount of information is produced.Due to the rapidly increasing amount of biomedical literature available on the Web,biomedical information extraction becomes more and more important

Biomedical named entity recognition (NER) is a subtask of biomedical mation extraction which is a fundamental step and can affect the results of otherstasks Biomedical NER is a computational technique used to identify and classifystrings of text (mentions) that designate important concepts in biomedicine As thefirst stage in the integrated semantic linking of knowledge between literature andstructured databases it is critically important to maximize the effectiveness of thisstep

infor-This thesis focuses on the analysis and identification of a new class of entity:phenotypes FollowHoehndorf et al.(2010), phenotype is important for the analysis

of the molecular mechanisms underlying disease; it is also expected to play a keyrole in inferring gene function in complex heritable diseases Two thoughts motivateour work are: (1) The database curation community has expressed a wish for fulltext entity indexing and the inclusion of phenotypes (Dowell et al.,2009;Hirschman

et al.,2012), and (2) Biomedicine is rapidly moving towards full-scale integration ofdata, opening up the possibility to understand complex heritable diseases caused bygenes Association studies involving phenotypes are considered important to makingprogress (Lage et al.,2007;Wu et al.,2008) The ultimate goal of the work we present

1

Trang 14

1.2 Phenotype definition 2

here is to allow relations mined from sentences such as the one we annotated below

to feed into novel hypothesis generation procedures From Ex 1, the reader can easilyinfer a relation between ‘IgG1 disorder’ and three genes/gene products marked asGGP

Ex 1 Among [patients]ORGAN ISM with [systemic lupus erythematosus]DISEASE

([SLE]DISEASE), those with the [IgG1 disorder]P HEN OT Y P E have a higher prevalence

of high titre [rheumatoid factor]GGP and [antinuclear antibody]GGP, but a lowerprevalence of [anti-double-stranded DNA (anti-dsDNA) antibodies]GGP above 30U/ml (Source PMCID: PMC1003566)

Unlike genes or anatomic structures, phenotypes and their traits are complexconcepts and do not constitute a homogeneous class of objects (i.e a natural kind).Traits such as ‘eye colour’, ‘blood group’, ‘hemoglobin concentration’ or ‘facial gri-macing’ describe morphological structures, physiological processes and behaviours.When qualities or quantities of traits are used to describe a specific organism then

we have phenotypic descriptions, e.g ‘blue eyes’, ‘blood group AB’, ‘not havingbetween 13 and 18 gm/dl hemoglobin concentration’

Until recently, there has been little effort to provide data integration standardsfor phenotypes This means that phenotypic descriptions tend to be author/studyspecific and biological results may go undiscovered if the terms used lie outside anauthor’s immediate research area (Bard and Rhee, 2004) In some researches, it issimply called as ‘phenotypic information’ and authors do not give any specific def-inition for it (Hoehndorf et al., 2010) In CSI-OMIM system (Cohen et al., 2011),phenotypes are considered as genetic terms including clinical signs and symptoms

Freimer and Sabatti (2003) describe phenotypes as referring to ‘any morphologic,biochemical, physiological or behavioral characteristic of an organism All phe-notypic characteristicsrepresent the expression of particular genotypes combined withthe effects of specific environmental influences’.Khordad et al (2011) defines phe-notypes as ‘genetically-determined observable characteristics of a cell or organism,including the result of any test that is not a direct test of the genotype .A pheno-type of an organism is determined by the interaction of its genetic constitution andthe environment’

Trang 15

1.3 The challenges of phenotype entity recognition 3

Our definition of phenotype was taken from the formal analysis inScheuermann

et al.(2009)’s research

Definition: A phenotype entity is a (combination of) bodily features(s)

of an organism determined by the interaction of its genetic make-up andenvironment

Examples include: [lack of kidney], [abnormal cell migration],[absent ankle flexes] as well as more complex cases such as [no abnormality in his heart], [unfa-vorable serum lipid levels] and [susceptibility to ulcerative colitis]

re-But Scheuermann et al (2009) also define symptom as ‘a bodily feature of apatient that is observed by the patient or clinician and suspected of being caused

by a disease’ We can see an ambiguity made by the causality (or context) here:

a term may be symptom in some contexts but refer to phenotype in others ormany symptoms may be phenotypes Thus, it is important to recognize that thisphenotype definition requires us to know the underlying cause Since causality isoften difficult to establish using narrow contextual evidence of the sort used in NER

it seems reasonable that we focus here on identifying bodily features themselves, i.e.phenotype candidates, and then determine causality in another stage of processing.Definition: A bodily feature (BF) entity is a mention of a bodily quality

in an organism It is considered as phenotype candidate

Our definition of bodily features require two caveats (1) in contrast to Khordad

et al.(2011) we did not apply a granular cut off at the level of cell, and (2) because

of the diversity of bodily features across organisms we took a decision to focus ourdefinition of this entity on mouse as a model organism and human as the mostimportant species

recogni-tion

Unlike NER in the newswire domain, NER in the biomedical domain remains

a perplexing challenge Biomedical NEs in general do not follow any nomenclature,and can be comprised of long compound words or short abbreviations Some evencontain various symbols or spelling variations We summarize some challenges for BFNER below (some of them are difficulties of NER in biomedical domain mentioned

byLin et al (2004))

Trang 16

1.3 The challenges of phenotype entity recognition 4

• Unknown word identification: There are an extreme use of unknown words.Unknown words can be acronyms, abbreviations, or words containing hyphens,digits, letters, and Greek letters Moreover, the use of numerous synonyms andhomonyms make recognition become more difficult

• Named entity boundary identification: The boundary of an NE can be a regularEnglish word, unknown word, Roman numeral, or digit A BF can apply at alllevels of anatomical granularity from chemical structures to cells and organsmaking it difficult to know where to draw a boundary Additionally, nestedNEs (an NE embedded in another NE) further complicate this problem: BFcan contain GGP, disease and even organism

• Named entity classification: Once an NE is identified, it is then classified into acategory such as GGP, anatomy, BF, and so on Ambiguity and inconsistencyare often encountered at this stage NEs with the same orthographical featuresmay fall into different categories (for example, there is a big ambiguity between

BF and disease) In additional, BF entities are intrinsically more difficult toanalyze due to their complex semantics, scale and structure:

• Semantically, a BF can be abnormal (in a disordered disposition) or normal(in an ordered disposition) feature of humans or mice; it can be a clinicallyrelevant characteristic of a human/mouse disease or not

• A lack of standard nomenclatures, extensive and growing nomenclatures makethe problem of BF recognition become more difficult , the lack of namingagreement prior to a standard name being accepted,

• BFs can be found with complex structure in various forms, sometimes evenbiologists do not agree on the boundary of the BF BF may contain modifiers(for example, quantification that are either specific (e.g 18 gm/dl) or rela-tive (e.g normal or increased’)); negations can be used to indicate lack of ananatomy/GGP or normal/abnormal qualities of anatomy/GGP (for example:[not having kidney], [not having between 13 and 18 gm/dl hemoglobin con-centration]) but it can also show that a human or mouse not have a BF (forexample: there is [no abnormality in his heart], she has a [fever] but doesn’thave a [cough]); conjoined cases happen when two or more BFs share one head

Trang 17

1.3 The challenges of phenotype entity recognition 5

Due to the motivation and challenges of phenotype recognition, the key butions of this thesis are: (1) To provide an operational semantics for identifyingphenotype candidates in text, (2) To introduce a set of guidelines and an annotatedcorpus based on a selection of 19 clinically significant auto-immune diseases fromThe Online Mendelian Inheritance of Man (OMIM) (Hamosh et al., 2005), one ofthe most widely used gene-disease databases, and (3) To mitigate linguistic varia-tion whilst still meeting the conceptual expectations of biologists we propose a newnamed entity solution that uses statistical inference and external manually craftedresources

contri-The remaining of this thesis is organized as follows In the second chapter, wepresent some related researches and useful resources The next chapter describesour Phenominer corpus version 1.0 and proposed method for phenotype candidaterecognition Then, experimental results, evaluation and discussion are in 3rdchapter.Finally, 4th chapter is the conclusions

Trang 18

as our baseline method for BF.

Using available resources help us not only to take advantage of knowledge fromother researches but also to reduce effort Up to now, there are many resources areused in bio-informatics Among these, linguistically corpora such as GENIA (Tateisi

et al., 2000; Kim et al., 2003), OMIM (Hamosh et al., 2005), have proven to becentral to the NER solution However due to the size of the vocabularies involved,annotated corpora by themselves do not provide a complete solution Researchershave therefore also looked at the rich availability of formally structured biomedi-cal knowledge (ontologies) such as the Unified Medical Language System (UMLS)(Bodenreider et al., 2002), the Human Phenotype Ontology (Robinson and Mund-los,2010), the Mammalian Phenotype Ontology (Smith and Eppig,2009), the GeneOntology (Gene Ontology Consortium, 2000), etc

Trang 19

2.1 Useful resources 7

2.1.1 GENIA and JNLPBA corpora

GENIA corpus version 3.0 (Kim et al.,2003) was formed from a controlled search

on MEDLINE using the MeSH terms ’human’, ’blood cells’ and ’transcription tors’ From this search, 2000 abstracts (20,546 sentences, more than 400,000 words)were selected This corpus has been released with linguistically rich annotations in-cluding sentence boundaries, term boundaries, term classifications, semi-structuredcoordinated clauses, recovered ellipsis in terms, etc Entities are hand annotated into

fac-36 classes of DNA, RNA, cell line, cell type and protein (almost 100,000 tions)

annota-JNLPBA data set came from the GENIA version 3.02 corpus It is a trainingset for the Bio-Entity recognition task at JNLPBAKim et al (2004) In this sharetask, they simplify 36 classes of GENIA corpus and used only the classes protein,DNA, RNA, cell line and cell type

The GENIA and JNLPBA corpora is important for two major reasons: the first

is it provides the large single source of annotated training data for the NE task inmolecular biology and the second is in the breadth of classification FollowKim et al

(2004), although number of classes in GENIA/JNLPBA corpora is a fraction of theclasses contained in major taxonomies it is still the largest class set that has beenattempted so far for the named entity recognition task Moreover, GENIA corpuscan be also used for other biomedical tasks, such as POS tagging

2.1.2 The online mendelian inheritance in man

The Online Mendelian Inheritance in Man (OMIM) (Hamosh et al., 2005) is acontinuously updated catalog of human genes and genetic disorders and traits, withparticular focus on the molecular relationship between genetic variation and pheno-typic expression (genotype and phenotype) The full text and referenced overviews

in OMIM contain information on many mendelian disorders and over 12,000 genes.Derived from the biomedical literature, OMIM is written and edited at JohnsHopkins University with input from scientists and physicians around the world EachOMIM entry has a full text summary of a genetically determined phenotype and/orgene and has numerous links to other genetic databases such as DNA and proteinsequence, PubMed references, general and locus-specific mutation databases, HUGOnomenclature, MapViewer, GeneTests, patient support groups and many others.Within an OMIM entry, there is a field called ‘Clinical Synopsis’ which is a list of

Trang 20

2.1 Useful resources 8

the clinical features of the disorder appear in this entry or references of this entry.There are over 4500 clinical synopses in OMIM, they are a important resources forresearches on Phenotype

OMIM is an easy and straightforward portal to the burgeoning information in man genetics, it is now distributed electronically by the National Center for Biotech-nology Information1 Over five decades OMIM has achieved great success, it is one ofthe most important information source about human genes and genetic phenotypes(Cohen et al., 2011; Robinson and Mundlos, 2010)

hu-Nonetheless OMIM does not use a controlled vocabulary to describe the typic features in its clinical synopsis section that makes it inappropriate for datamining usages In the section 2.1.3, we introduce HPO which is constructed usingOMIM

pheno-2.1.3 The human phenotype ontology

The Human Phenotype Ontology (HPO)2 is a standardized, controlled ulary allows phenotypic information to be described in an unambiguous fashion inmedical publications and databases (Robinson and Mundlos, 2010)

vocab-The HPO was originally constructed using data from OMIM by merging synonymand creating the hierarchical structure between terms according to their semantics.The hierarchical structure in the HPO represents the subclass relationship, figure

2.1 is a describe a hierarchical structure of HPO by a example of ‘atrioventricularseptal defect’ [HP:0010439] (example comes from Robinson and Mundlos (2010)).The HPO currently contains over 9500 unique terms (more than 15000 synonyms)describing human phenotypic features (statistic in 2012)

Nevertheless, follow Khordad et al (2011), HPO is not complete and we hadseveral problems finding phenotype names in it:

(1) some acronyms and abbreviations are not available in the HPO;

(2) although the HPO contains synonyms of phenotypes, there are still somesynonyms that are not included in the HPO;

(3) in some cases adjectives and other modifiers are added to phenotype names,making it difficult to find these phenotype names in the ontology;

(4) new phenotypes are being continuously introduced to the biomedicine world,

1 http://www.ncbi.nlm.nih.gov/omim/

Trang 21

2.1 Useful resources 9

HPO is being constantly refined, corrected, and expanded manually, but this process

is not fast enough nor can the inclusion of new phenotypes be guaranteed

Thus, although HPO is a very useful resources, using only it is not enough forphenotype recognition, we should use it just as a additional resources

2.1.4 The mammalian phenotype ontology

The Mammalian Phenotype Ontology (MP) (Smith and Eppig, 2009) has beenapplied to mouse phenotype descriptions in MGI3, RGD4, OMIA5 and elsewhere.Use of this ontology allows comparisons of data from diverse sources, can facilitatecomparisons across mammalian species, assists in identifying appropriate experi-mental disease models, and aids in the discovery of candidate disease genes andmolecular signaling pathways

Similar with HPO, the Mammalian Phenotype Ontology (MP) is a standardizedhierarchical structured vocabulary The highest level terms describe physiologicalsystems, survival, and behavior The physiological systems branch into morpho-logical and physiological phenotype terms at the next node level The example ofhierarchical tree for the term ‘opisthotonus’ [MP:0002880] is shown in figure 1 2.2

(example comes fromSmith and Eppig (2009))

MP has about 9000 unique terms (about 24000 synonyms) of mouse abnormalphenotype descriptions (statistic in 2012)

2.1.5 The unified medical language system

The Unified Medical Language System (UMLS) (Bodenreider et al.,2002) is a set

of files and software that brings together many health and biomedical vocabulariesand standards The UMLS has three tools, which we call the Knowledge Sources:Metathesaurus, semantic network and SPECIALIST Lexicon and Lexical Tools

• The Metathesaurus is a very large, multi-purpose, and multi-lingual lary database that contains information about biomedical and health relatedconcepts, their various names, and the relationships among them It containsmore than 1.8 million concepts come from more than 100 source vocabularies

vocabu-3 Mouse Genome Informatics Database: http://www.informatics.jax.org/

4 Rat Genome Database: http://rgd.mcw.edu

5 Online Mendelian Inheritance in Animals: http://omia.angis.org.au/

Trang 22

2.1 Useful resources 10

• The Metathesaurus is linked to the Semantic Network: all concepts in theMetathesaurus are assigned to at least one semantic type from the semanticnetwork

• MetaMap is a well-known tool in the UMLS SPECIALIST Lexicon and ical tools It is a highly configurable application to map biomedical text tothe UMLS Metathesaurus: MetaMap tokenizes and phrase chunking the inputtext; map them to UMLS concepts, each phrase is mapped to a set of candi-date concepts; word sense disambiguation step will choose the best candidatewith respect to the surrounding text

lex-However UMLS semantic network does not contain Phenotype as a semantic type

so it alone is not adequate to distinguish between phenotypes and other objects intext In addition, some phenotype names do not exist in the UMLS Metathesaurus

at all But UMLS and its knowledge sources may be useful for phenotype recognition

in some ways

2.1.6 KMR corpus

We call a manually annotated corpus inKhordad et al.(2011) ‘KMR corpus’ It is

a collection of 3784 tokens (120 sentences) with 110 annotated phenotype mentions.Sentences in KMR corpus were taken from 4 PubMed papers from the year 2009 inthe area of human genetics Annotation was conducted with reference to the HPO

so that a term was tagged as phenotype if it was in the HPO or if it was not in theHPO but its definition showed that it was caused by a genotype

It is not a well-known corpus and only be used in Khordad et al (2011) searches But now we are lack of annotated corpus for phenotype so it is still avaluable choice We will use this corpus for testing and analyzing our proposedmodel

Above, we just introduce some of the most typical useful resources for our searches In additional to them, there are many other resources for bio-informaticsthat can be used such as medical subject headings6, Gene list contains more than 9millions genes7, etc

re-6 MeSH:http://www.nlm.nih.gov/mesh/meshhome.html

7 Created by National Center for Biotechnology Information, U.S National Library of Medicine

Trang 23

2.2 Related researches 11

Named Entity Recognition in the biomedical domain has been extensively ied and, as a consequence, many methods have been proposed Some methods likeMetaMap are generic methods and find many kinds of entities in the text Somemethods, are specialized to recognize particular type of entities However, thesetechniques tend to emphasize finding the name of genes, gene products, cells, dis-eases and chemical (Fukuda et al.,1998; Rindflesch et al., 1999;Collier et al.,2000;

stud-Kazama et al., 2002;Zhou et al.,2003;Settles,2004; Kim et al.,2004; Leaman andGonzalez,2008) So far, there have been a small number of researches done for phe-notype they often based primarily on a available resources or rule-based method.Whilst other authors have tried similar approaches for other entity types, none havetried both machine learning and external resource lookup for a class as rich andsemantically complex as phenotypes

In this section, we describe a method proposed by Khordad et al (2011) which

is used as our base-line method for comparison in the experiments

2.2.1 Baseline method: Khordad et al ( 2011 )

The system built in Khordad et al (2011) is based on Metamap and makesuse of the UMLS Metathesaurus and the Human Phenotype Ontology From aninitial basic system that uses only these pre-existing tools, five rules that capturestylistic and linguistic properties of this type of literature are proposed to enhancethe performance of our NER tool A block diagram showingKhordad et al (2011)’ssystem processing is shown in figure2.3 The system performs the following steps:

• (1) MetaMap chunks the input text into phrases and assigns the UMLS mantic types associated with each noun phrase

se-• (2) The Disorder Recognizer analyzes the MetaMap output to find phenotypesand phenotype candidates This is the most important part of this method,

it based primarily on the idea that phenotype must belong to some certainUMLS semantic types The UMLS Semantic Network contains 133 SemanticTypes which are categorized into 15 Semantic Groups that are more general

In which, the Semantic Group Disorders contains 12 semantic types that areclose to the meaning of phenotype, they are: Acquired Abnormality, AnatomicalAbnormality, Cell or Molecular Dysfunction, Congenital Abnormality, Disease

Trang 24

2.2 Related researches 12

or Syndrome, Experimental Model of Disease, Finding, Injury or Poisoning,Mental or Behavioral Dysfunction, Neoplastic Process, Pathologic Function,Sign or Symptom In this step, phrase are not belong to this semantic groupare rejected

But a number of semantic types in this semantic group may include conceptsthat are not phenotypes The 7 problematic semantic groups are: Finding,Disease or Syndrome, Experimental Model of Disease, Injury or Poisoning,Sign or Symptom, Pathologic Function, and Cell or Molecular Dysfunction.Therefore, if a phrase is assigned to these semantic types, it is considered asphenotype candidate and will be confirmed as phenotype or not in step (3),otherwise, it is a phenotype

• (3) Phenotype candidates from the previous step are searched in the HPO usingOBO-Edit8 Phenotype candidates that are found in the HPO are recognized

as phenotypes

• (4) Result Merger merges the phenotypes found by disorder recognizer andOBO-Edit and makes the output that is the final list of available phenotypes

in the input text

This model is tested on a small corpus KMR (described in section 2.1.6) tated by authors The results is precision is 97.58, recall is 88.32 and F1 is 92.71

Trang 25

anno-2.2 Related researches 13

Figure 2.1: A visual example of HPO hierarchical structure

HP:0010439

Trang 26

2.2 Related researches 14

Figure 2.2: A visual example of MP hierarchical structure

MP:0002880

Trang 27

2.2 Related researches 15

Figure 2.3: Khordad et al (2011)’s system block diagram

Trang 28

Chapter 3

Methods

In this chapter, firstly, we analyze two entities that we employed in this study:gene/gene product (GGP) and bodily feature (BF) in details (section3.1) Then, insection3.2, we introduce our Phenominer corpus version 1.0 which is built based on

19 auto-immune diseases, this corpus can be used in phenotype recognition as well

as other biomedical problem And last, section 3.3 describe our proposed Hybridmodel for BF and GGP entities recognition, the model consists of there main parts:machine learning labeler, knowledge-based labeler and merge results module

applica-Rebholz-Schuhmann et al.(2010) Because of space limitations we will not provide arigidly formal definition or a taxonomic analysis (Beisswanger et al.,2008) Futurework will explore the relationships between these and other entity types

In line with BioTop (Beisswanger et al.,2008), GGP is relatively straightforward

to define by the conjunction of (BioTop ID Nucleic Acid Structure) and (BioTop IDPeptide Structure)

Definition: A gene/gene product (GGP) entity is a mention of one

of three major macro-molecules DNA, RNA or protein DNA and RNA

Trang 29

3.1 Schema 17

are nucleic acid sequences containing the genetic instructions used inthe development and function of an organism Proteins are polypeptidesequences, or parts of polypeptide sequences, folded into structures thatfacilitate biological function

Examples include: [cryoglobulins], [anticariolipin antibodies], [AFM044xg3], mosome 17q], [CC16 protein]

[chro-As mentioned in chapter 1, in this thesis, we use the definition of bodily feature(BF) as Phenotype candidate

Definition: A bodily feature (BF) entity is a mention of a bodily quality

in an organism

Examples include: [lack of kidney], [abnormal cell migration],[absent ankle flexes] as well as more complex cases such as [no abnormality in his heart], [unfa-vorable serum lipid levels] and [suceptibility to ulcerative colitis]

re-Figure 3.1 is an informal overview of bodily feature entity It visually describessome forms of BFs obtained from the data surveying, contains: structural attribute,qualitative attribute, functional attribute and process attribute

Figure 3.1: An informal overview of bodily feature entity

Trang 30

For example: [black hair], [not having between 13 and 18 gm/dl hemoglobinconcentration], [adult female height 130-157 cm], [conjoined fingers]

• Functional attributes are related to functions and disposition of anatomy(Hoehndorf et al., 2010) Intuitively, functions of anatomy establish the rea-son (or cause) that an anatomy exists while their dispositions determine theircapabilities and potentials For example, the endocrine pancreatic cells have

a function to produce insulin, and normally have a disposition to produce sulin In general, functional attribute shows the lack or abnormality of anatomyfunction

in-For example: [facial grimacing], [sleepy facial expression], [reading disability],[hypotension], [deaf]

• Process attributes represent characteristics of the process themselves Theyinclude characteristics of physiological process, metabolic process, biologicalpathways, chemical reactions, gene-related process, gene expression, etc Theexpression of process attribute sometimes have complex structure, but follow-ing the discussion of phenotypes as processes in physiology (Hoehndorf et al.,

2012) we include some mentions of processes within the scope of our annotationschema

For example: [defective DNA repair after ultraviolet radiation damage], normality of metabolism], [proliferation of BAF-32 cells]

Ngày đăng: 23/09/2020, 21:07

TỪ KHÓA LIÊN QUAN

TÀI LIỆU CÙNG NGƯỜI DÙNG

TÀI LIỆU LIÊN QUAN

w