Collecting data relevant for theexisting ontology can also be used in some other phases of the semi-automatic ontology construction process, such as ontology evaluation orontology refine
Trang 1learning scenario 5 in Section 2.5) Collecting data relevant for theexisting ontology can also be used in some other phases of the semi-automatic ontology construction process, such as ontology evaluation orontology refinement (phases 5 and 6, Section 2.4), for instance, via associ-ating new instances to the existing ontology in a process called ontologygrounding (Jakulin and Mladenic, 2005) In the case of topic ontologies(see Chapter 7), where the concepts correspond to topics and documentsare linked to these topics through an appropriate relation such ashasSubject (Grobelnik and Mladenic 2005a), one can use the Web tocollect documents on a predefined topic In Knowledge Discovery, theapproaches dealing with collecting documents based on the Web data arereferred in the literature under the name Focused Crawling (Chakrabarti,2002; Novak, 2004b) The main idea of these approaches is to use theinitial ‘seed’ information given by the user to find similar documents byexploiting (1) background knowledge (ontologies, existing documenttaxonomies, etc.), (2) web topology (following hyperlinks from therelevant pages), and (3) document repositories (through search engines).The general assumption for most of the focused crawling methods is thatpages with more closely related content are more inter-connected In thecases where this assumption is not true (or we cannot reasonably assumeit), we can still use the methods for selecting the documents throughsearch engine querying (Ghani et al., 2005) In general, we could say thatfocused crawling serves as a generic technique for collecting data to beused in the next stages of data processing, such as constructing (ontologylearning scenario 4 in Section 2.5) and populating ontologies (ontologylearning scenario 3 in Section 2.5).
2.6.5 Data Visualization
Visualization of data in general and also visualization of documentcollections is a method for obtaining early measures of data quality,content, and distribution (Fayyad et al., 2001) For instance, by apply-ing document visualization it is possible to get an overview of thecontent of a Web site or some other document collection This can beuseful especially for the first phases of semi-automatic ontology con-struction aiming at domain and data understanding (see Section 2.4).Visualization can be also used for visualizing an existing ontology orsome parts thereof, which is potentially relevant for all the ontologylearning scenarios defined in Section 2.5
One general approach to document collection visualization is based
on clustering of the documents (Grobelnik and Mladenic, 2002) byfirst representing the documents as word-vectors and performingk-means clustering on them (see Subsection 2.6.1) The obtained clustersare then represented as nodes in a graph, where each node in thegraph is described by the set of most characteristic words in the
Trang 2corresponding cluster Similar nodes, as measured by their similarity (Equation (2.2)), are connected by a link When such agraph is drawn, it provides a visual representation of the documentset (see Figure 2.1 for an example output of the system) An alternativeapproach that provides different kinds of document corpus visualiza-tion is proposed in Fortuna et al., 2005b) It is based on Latent SemanticIndexing, which is used to extract hidden semantic concepts from textdocuments and multidimensional scaling which is used to map the highdimensional space onto two dimensions Document visualization can
cosine-be also a part of more sophisticated tasks, such as generating a semanticgraph of a document or supporting browsing through a news collection.For illustration, we provide two examples of document visualizationthat are based on Knowledge Discovery methods (see Figure 2.2 andFigure 2.3) Figure 2.2 shows an example of visualizing a single docu-ment via its semantic graph (Leskovec et al., 2004) Figure 2.3 shows anexample of visualizing news stories via visualizing relationshipsbetween the named entities that appear in the news stories (Grobelnikand Mladenic, 2004)
Figure 2.1 An example output of a system for graph-based visualization of ment collection The documents are 1700 descriptions of European research projects
docu-in docu-information technology (5FP IST).
Trang 3Figure 2.3 Visual representation of relationships (edges in the graph) between the named entities (vertices in the graph) appearing in a collection of news stories Each edge shows intensity of comentioning of the two named entities The graph is an example focused on the named entity ‘Semantic Web’ that was extracted from the 11.000 ACM Technology news stories from 2000 to 2004.
Figure 2.2 Visual representation of an automatically generated summary of a news story about earthquake The summarization is based on deep parsing used for obtaining semantic graph of the document, followed by machine learning used for deciding which parts of the graph are to be included in the document summary.
Trang 42.7 RELATED WORK ON ONTOLOGY CONSTRUCTIONDifferent approaches have been used for building ontologies, most ofthem to date using mainly manual methods An approach to buildingontologies was set up in the CYC project (Lenat and Guha, 1990), wherethe main step involved manual extraction of common sense knowledgefrom different sources There have been some methodologies for buildingontologies developed, again assuming a manual approach For instance,the methodology proposed in (Uschold and King, 1995) involves thefollowing stages: identifying the purpose of the ontology (why to build it,how will it be used, the range of the users), building the ontology,evaluation and documentation Building of the ontology is further dividedinto three steps The first is ontology capture, where key concepts andrelationships are identified, a precise textual definition of them is written,terms to be used to refer to the concepts and relations are identified, theinvolved actors agree on the definitions and terms The second stepinvolves coding of the ontology to represent the defined conceptualiza-tion in some formal language (committing to some meta-ontology,choosing a representation language and coding) The third step involvespossible integration with existing ontologies An overview of methodol-ogies for building ontologies is provided in Ferna´ndez (1999), whereseveral methodologies, including the above described one, are presentedand analyzed against the IEEE Standard for Developing Software LifeCycle Processes, thus viewing ontologies as parts of some softwareproduct As there are some specifics to semi-automatic ontology con-struction compared to the manual approaches to ontology construction,the methodology that we have defined (see Section 2.4) has six phases If
we relate them to the stages in the methodology defined in Uschold andKing (1995), we can see that the first two phases referring to domain anddata understanding roughly correspond to identifying the purpose of theontology, the next two phases (tasks definition and ontology learning)correspond to the stage of building the ontology, and the last two phases onontology evaluation and refinement correspond to the evaluation anddocumentation stage
Several workshops at the main Artificial Intelligence and ledge Discovery conferences (ECAI, IJCAI, KDD, ECML/PKDD)have been organized addressing the topic of ontology learning Most
Know-of the work presented there addresses one Know-of the following problems/tasks:
Extending the existing ontology: Given an existing ontologywith concepts and relations (commonly used is the English lexi-cal ontology WordNet), the goal is to extend that ontology usingsome text, for example Web documents are used in (Agirre et al.,2000) This can fit under the ontology learning scenario 5 inSection 2.5
Trang 5Learning relations for an existing ontology: Given a collection of textdocuments and ontology with concepts, learn relations between theconcepts The approaches include learning taxonomic, for example isa,(Cimiano et al., 2004) and nontaxonomic, for example ‘hasPart’ rela-tions (Maedche and Staab, 2001) and extracting semantic relationsfrom text based on collocations (Heyer et al., 2001) This fits under theontology learning scenario 2 in Section 2.5.
Ontology construction based on clustering: Given a collection of text ments, split each document into sentences, parse the text and applyclustering for semi-automatic construction of an ontology (Bisson et al.,2000; Reinberger and Spyns, 2004) Each cluster is labeled by the mostcharacteristic words from its sentences or using some more sophisticatedapproach (Popescul and Ungar, 2000) Documents can be also used as awhole, without splitting them into sentences, and guiding the userthrough a semi-automatic process of ontology construction (Fortuna
docu-et al., 2005a) The system provides suggestions for ontology concepts,automatically assigns documents to the concepts, proposed naming ofthe concepts, etc In Hotho et al (2003), the clustering is further refined byusing WordNet to improve the results by mapping the found sentenceclusters upon the concepts of a general ontology The found concepts can
be further used as semantic labels (XML tags) for annotating documents.This fits under the ontology learning scenario 4 in Section 2.5
Ontology construction based on semantic graphs: Given a collection oftext documents, parse the documents; perform coreference resolution,anaphora resolution, extraction of subject-predicate-object triples, andconstruct semantic graphs These are further used for learning sum-maries of the documents (Leskovec et al., 2004) An example summaryobtained using this approach is given in Figure 2.2 This can fit underthe ontology learning scenario 4 in Section 2.5
Ontology construction from a collection of news stories based onnamed entities: Given a collection of news stories, represent it as acollection of graphs, where the nodes are named entities extractedfrom the text and relationships between them are based on the contextand collocation of the named entities These are further used forvisualization of news stories in an interactive browsing environment(Grobelnik and Mladenic, 2004) An example output of the proposedapproach is given in Figure 2.3 This can fit under the ontologylearning scenario 4 in Section 2.5
More information on ontology learning from text can be found in acollection of papers (Buitelaar et al., 2005) addressing three perspectives:methodologies that have been proposed to automatically extract informa-tion from texts, evaluation methods defining procedures and metrics for aquantitative evaluation of the ontology learning task, and applicationscenarios that make ontology learning a challenging area in the context ofreal applications
Trang 62.8 DISCUSSION AND CONCLUSION
We have presented several techniques from Knowledge Discovery thatare useful for semi-automatic ontology construction In that light, wepropose to decompose the semi-automatic ontology construction processinto several phases ranging from domain and data understanding throughtask definition via ontology learning to ontology evaluation and refinement Alarge part of this chapter is dedicated to ontology learning Severalscenarios are identified in the ontology learning phase depending ondifferent assumptions regarding the provided input data and theexpected output: inducing concepts, inducing relations, ontology popu-lation, ontology construction, and ontology updating/extension Differ-ent groups of Knowledge Discovery techniques are briefly describedincluding unsupervised learning, semi-supervised, supervised andactive learning, on-line learning and web-mining, focused crawling,data visualization In addition to providing brief description of thesetechniques, we also relate them to different ontology learning scenariosthat we identified
Some of the described Knowledge Discovery techniques havealready been applied in the context of semi-automatic ontology con-struction, while others still need to be adapted and tested in thatcontext A challenge for future research is setting up evaluationframeworks for assessing contribution of these techniques to specifictasks and phases of the ontology construction process In that light, webriefly describe some existing approaches to ontology constructionand point to the original papers that provide more information on theapproaches, usually including some evaluation of their contributionand performance on the specific tasks We also related existing work
on learning ontologies to different ontology learning scenarios that wehave identified Our hope is that this chapter in addition to contribut-ing by proposing a methodology for semi-automatic ontology con-struction and description of some relevant Knowledge Discoverytechniques also shows potential for future research and triggerssome new ideas related to the usage of Knowledge Discovery techni-ques for ontology construction
ACKNOWLEDGMENTS
This work was supported by the Slovenian Research Agency and the ISTProgramme of the European Community under SEKT SemanticallyEnabled Knowledge Technologies (IST-1-506826-IP) and PASCAL Net-work of Excellence (IST-2002-506778) This publication only reflects theauthors’ views
Trang 7on Ontology Learning OL-2000 The 14th European Conference on ArtificialIntelligence ECAI-2000.
Bloehdorn S, Haase P, Sure Y, Voelker J, Bevk M, Bontcheva K, Roberts I 2005.Report on the integration of ML, HLT and OM SEKT Deliverable D.6.6.1, July2005
Blum A, Chawla S 2001 Learning from labelled and unlabelled data using graphmincuts Proceedings of the 18th International Conference on Machine Learn-ing, pp 19–26
Buitelaar P, Cimiano P, Magnini B 2005 Ontology learning from text: Methods,applications and evaluation frontiers in Artificial Intelligence and Applications,IOS Press
Brank J, Grobelnik M, Mladenic D 2005 A survey of ontology evaluationtechniques Proceedings of the 8th International multi-conference InformationSociety IS-2005, Ljubljana: Institut ‘‘Jozˇef Stefan’’, 2005
Chakrabarti S 2002 Mining the Web: Analysis of Hypertext and Semi Structured Data.Morgan Kaufmann
Chapman P, Clinton J, Kerber R, Khabaza T, Reinartz T, Shearer C, Wirth R 2000.CRISP-DM 1.0: Step-by-step data mining guide
Cimiano P, Pivk A, Schmidt-Thieme L, Staab S 2004 Learning taxonomic relationsfrom heterogeneous evidence In Proceedings of ECAI 2004 Workshop onOntology Learning and Population
Craven M, Slattery S 2001 Relational learning with statistical predicate invention:better models for hypertext Machine Learning 43(1/2):97–119
Cunningham H, Bontcheva K 2005 Knowledge management and humanlanguage: crossing the chasm Journal of Knowledge Management
Deerwester, S., Dumais, S., Furnas, G., Landuer, T., Harshman, R., (2001).Indexing by Latent Semantic Analysis
Duda RO, Hart PE, Stork DG 2000 Pattern Classification (2nd edn) John Wiley &Sons, Ltd
Ehrig M, Haase P, Hefke M, Stojanovic N 2005 Similarity for ontologies—Acomprehensive framework Proceedings of 13th European Conference onInformation Systems, May 2005
Fayyad, U., Grinstein, G G and Wierse, A (eds.), (2001) Information tion in Data Mining and Knowledge Discovery, Morgan Kaufmann
Visualiza-Fayyad U, Piatetski-Shapiro G, Smith P, Uthurusamy R (eds) 1996 Advances inKnowledge Discovery and Data Mining MIT Press: Cambridge, MA, 1996.Ferna´ndez LM 1999 Overview of methodologies for building ontologies InProceedings of the IJCAI-99 workshop on Ontologies and Problem-SolvingMethods (KRR5)
Fortuna B, Mladenic D, Grobelnik M 2005a Semi-automatic construction of topicontology Proceedings of the ECML/PKDD Workshop on Knowledge Discov-ery for Ontologies
Fortuna B, Mladenic D, Grobelnik M 2005b Visualization of text documentcorpus Informatica journal 29(4):497–502
Trang 8Ghani R, Jones R, Mladenic D 2005 Building minority language corpora
by learning to generate web search queries Knowledge and information systems7:56–83
Grobelnik M, Mladenic D 2002 Efficient visualization of large text corpora.Proceedings of the seventh TELRI seminar Dubrovnik, Croatia
Grobelnik M, Mladenic D 2004 Visualization of news articles Informatica Journal28:(4)
Grobelnik M, Mladenic D 2005 Simple classification into large topic ontology
of Web documents Journal of Computing and Information Technology—CIT 134:279–285
Grobelnik M, Mladenic D 2005a Automated knowledge discovery in advancedknowledge management Journal of Knowledge Management
Hand DJ, Mannila H, Smyth P 2001 Principles of Data Mining (AdaptiveComputation and Machine Learning) MIT Press
Hastie T, Tibshirani R, Friedman JH 2001 The Elements of Statistical Learning: DataMining, Inference, and Prediction Springer Series in Statistics Springer Verlag.Heyer G, La¨uter M, Quasthoff U, Wittig T, Wolff C 2001 Learning Relations usingCollocations In Proceedings of IJCAI-2001 Workshop on Ontology Learning.Hotho A, Staab S, Stumme G 2003 Explaining text clustering results usingsemantic structures In Proceedings of ECML/PKDD 2003, LNAI 2838, SpringerVerlag, pp 217–228
Jackson P, Moulinier I 2002 Natural Language Processing for Online Applications:Text Retrieval, Extraction, and Categorization John Benjamins Publishing Co.Jakulin A, Mladenic D 2005 Ontology grounding Proceedings of the 8thInternational multi-conference Information Society IS-2005, Ljubljana: Institut
‘‘Jozˇef Stefan’’, 2005
Koller D, Sahami M 1997 Hierarchically classifying documents using very fewwords Proceedings of the 14th International Conference on Machine LearningICML-97, Morgan Kaufmann, San Francisco, CA, pp 170–178
Leskovec J, Grobelnik M, Milic-Frayling N 2004 Learning sub-structures ofdocument semantic graphs for document summarization In Workshop on LinkAnalysis and Group Detection (LinkKDD2004) The Tenth ACM SIGKDDInternational Conference on Knowledge Discovery and Data Mining
Maedche A, Staab S 2001 Discovering conceptual relations from text InProceedings of ECAI’2000, pp 321–325
Manning CD, Schutze H 2001 Foundations of Statistical Natural Language sing The MIT Press: Cambridge, MA
Proces-McCallum A, Rosenfeld R, Mitchell T, Ng A 1998 Improving text classification byshrinkage in a hierarchy of classes Proceedings of the 15th InternationalConference on Machine Learning ICML-98, Morgan Kaufmann, San Francisco,CA
Mitchell TM 1997 Machine Learning The McGraw-Hill Companies, Inc.Mladenic D 1998 Turning Yahoo into an Automatic Web-Page Classifier.Proceedings of 13th European Conference on Artificial Intelligence (ECAI’98,John Wiley & Sons, Ltd), pp 473–474
Mladenic D, Brank J, Grobelnik M, Milic-Frayling N 2002 Feature selection usinglinear classifier weights: Interaction with classification models, SIGIR-2002.Mladenic D, Grobelnik M 2003 Feature selection on hierarchy of web documents.Journal of Decision support systems 35:45–87
Mladenic D, Grobelnik M 2004 Mapping documents onto web page ontology InWeb Mining: From Web to Semantic Web, (Berendt B, Hotho A, Mladenic D,Someren MWV, Spiliopoulou M, Stumme G (eds) Lecture notes in artificial
Trang 9inteligence, Lecture notes in computer science, Vol 3209 Springer: Berlin;Heidelberg; New York, 2004; 77–96.
Novak B 2004a Use of unlabeled data in supervised machine learning ings of the 7th International multi-conference Information Society IS-2004,Ljubljana: Institut ‘‘Jozˇef Stefan’’, 2004
Proceed-Novak B 2004b A survey of focused web crawling algorithms Proceedings of the7th International multi-conference Information Society IS-2004, Ljubljana:Institut ‘‘Jozˇef Stefan’’, 2004
Popescul A, Ungar LH 2000 Automatic labeling of document clusters ment of Computer and Information Science, University of Pennsylvania,unpublished paper available from http://www.cis.upenn.edu/popescul/Publications/popescul00labeling.pdf
Depart-Reinberger M-L, Spyns P 2004 Discovering Knowledge in Texts for the learning
of DOGMA-inspired ontologies In Proceedings of ECAI 2004 Workshop onOntology Learning and Population
Sebastiani F 2002 Machine learning for automated text categorization ACMComputing Surveys
Steinbach M, Karypis G, Kumar V 2000 A comparison of document clusteringtechniques Proceedings of KDD Workshop on Text Mining (Grobelnik M,Mladenic´ D, Milic-Frayling N (eds)), Boston, MA, USA, pp 109–110
Uschold M, King M 1995 Towards a methodology for building ontologies InWorkshop on Basic Ontological Issues in Knowledge Sharing InternationalJoint Conference on Artificial Intelligence, 1995 Also available as AIAI-TR-183from AIAI, the University of Edinburgh
van Rijsbergen CJ 1979 Information Retrieval (2nd edn) Butterworths, London.Witten IH, Frank E 1999 Data Mining: Practical Machine Learning Tools andTechniques with Java Implementations Morgan Kaufmann
Trang 11Gartner reported in 2002 that for at least the next decade more than 95%
of human-to-computer information input will involve textual language.They also report that by 2012, taxonomic and hierarchical knowledgemapping and indexing will be prevalent in almost all information-richapplications There is a tension here: between the increasingly richsemantic models in IT systems on the one hand, and the continuingprevalence of human language materials on the other The process oftying semantic models and natural language together is referred to asSemantic Annotation This process may be characterised as the dynamiccreation of inter-relationships between ontologies (shared conceptualisa-tions of domains) and documents of all shapes and sizes in abidirectional manner covering creation, evolution, population and doc-umentation of ontological models Work in the Semantic Web (Berners-Lee, 1999; Davies et al., 2002; Fensel et al., 2002) (see also other chapters inthis volume) has supplied a standardised, web-based suite of languages(e.g., Dean et al., 2004) and tools for the representation of ontologies andthe performance of inferences over them It is probable that thesefacilities will become an important part of next-generation IT applica-tions, representing a step up from the taxonomic modelling that is nowused in much leading-edge IT software Information Extraction (IE), a
Semantic Web Technologies: Trends and Research in Ontology-based Systems
John Davies, Rudi Studer, Paul Warren # 2006 John Wiley & Sons, Ltd
Trang 12form of natural language analysis, is becoming a central technology tolink Semantic Web models with documents as part of the process ofMetadata Extraction.
The Semantic Web aims to add a machine tractable, repurposeablelayer to complement the existing web of natural language hypertext Inorder to realise this vision, the creation of semantic annotation, thelinking of web pages to ontologies and the creation, evolution andinterrelation of ontologies must become automatic or semi-automaticprocesses
In the context of new work on distributed computation, Semantic WebServices (SWSs) go beyond current services by adding ontologies andformal knowledge to support description, discovery, negotiation, media-tion and composition This formal knowledge is often strongly related toinformal materials For example, a service for multimedia content deliv-ery over broadband networks might incorporate conceptual indices ofthe content, so that a smart VCR (such as next generation TiVO) canreason about programmes to suggest to its owner Alternatively, a servicefor B2B catalogue publication has to translate between existing semi-structured catalogues and the more formal catalogues required for SWSpurposes To make these types of services cost-effective, we need auto-matic knowledge harvesting from all forms of content that containnatural language text or spoken data
Other services do not have this close connection with informal content,
or will be created from scratch using Semantic Web authoring tools Forexample, printing or compute cycle or storage services In these cases theopposite need is present: to document services for the human readerusing natural language generation
An important aspect of the world wide web revolution is that it hasbeen based largely on human language materials, and in making the shift
to the next generation knowledge-based web, human language willremain key Human Language Technology (HLT) involves the analysis,mining and production of natural language HLT has matured over thelast decade to a point at which robust and scaleable applications arepossible in a variety of areas, and new projects like SEKT in the SemanticWeb area are now poised to exploit this development
Figure 3.1 illustrates the way in which Human Language Technologycan be used to bring together the natural language upon which thecurrent web is mainly based and the formal knowledge at the basis of theSemantic Web Ontology-Based IE and Controlled Language IE arediscussed in this chapter, whereas Natural Language Generation iscovered in Chapter 8 on Knowledge Access
The chapter is structured as follows Section 3.2 provides an overview
of Information Extraction (IE) and the problems it addresses Section 3.3introduces the problem of semantic annotation and shows why it isharder than the issues addressed by IE Section 3.4 surveys someapplications of IE to semantic annotation and discusses the problems
30 SEMANTIC ANNOTATION AND HUMAN LANGUAGE TECHNOLOGY
Trang 13faced, thus justifying the need for the so-called Ontology-Based IEapproaches Section 3.5 presents a number of these approaches, includingsome graphical user interfaces Controlled Language IE (CLIE) is thenpresented as a high-precision alternative to information extraction fromunrestricted, ambiguous text The chapter concludes with a discussionand outlines future work.
3.2 INFORMATION EXTRACTION: A BRIEF INTRODUCTIONInformation extraction (IE) is a technology based on analysing naturallanguage in order to extract snippets of information The process takestexts (and sometimes speech) as input and produces fixed format,unambiguous data as output This data may be used directly for display
to users, or may be stored in a database or spreadsheet for later analysis,
or may be used for indexing purposes in information retrieval (IR)applications such as internet search engines like Google
IE is quite different from IR:
an IR system finds relevant texts and presents them to the user;
an IE application analyses texts and present only the specific tion from them that the user is interested in
informa-For example, a user of an IR system wanting information on trade groupformations in agricultural commodities markets would enter a list ofrelevant words and receive in return a set of documents (e.g., newspaperarticles) which contain likely matches The user would then read the
Figure 3.1 HLT and the Semantic Web
Trang 14documents and extract the requisite information themselves They mightthen enter the information in a spreadsheet and produce a chart for areport or presentation In contrast, an IE system would automaticallypopulate the spreadsheet directly with the names of relevant companiesand their groupings.
There are advantages and disadvantages with IE in comparison to IR
IE systems are more difficult and knowledge-intensive to build, and are
to varying degrees tied to particular domains and scenarios IE is morecomputationally intensive than IR However, in applications where thereare large text volumes IE is potentially much more efficient than IRbecause of the possibility of dramatically reducing the amount of timepeople spend reading texts Also, where results need to be presented inseveral languages, the fixed-format, unambiguous nature of IE resultsmakes this relatively straightforward in comparison with providing thefull translation facilities needed for interpretation of multilingual textsfound by IR
Useful overview sources for further details on IE include: Cowie andLehnert (1996), Appelt (1999), Cunningham (2005), Gaizauskas and Wilks(1998) and Pazienza (2003)
3 Descriptions of the entities present
4 Relations between entities
5 Events involving the entities
For example, consider the text:
‘Ryanair announced yesterday that it will make Shannon its next Europeanbase, expanding its route network to 14 in an investment worth around
s180m The airline says it will deliver 1.3 million passengers in the first year
of the agreement, rising to two million by the fifth year’
To begin with, IE will discover that ‘Shannon’ and ‘Ryanair’ are entities(of types location and company, perhaps), then, via a process of referenceresolution, will discover that ‘it’ and ‘its’ in the first sentence refer toRyanair (or are mentions of that company), and ‘the airline’ and ‘it’ in thesecond sentence also refer to Ryanair Having discovered the mentionsdescriptive information can be extracted, for example that Shannon is aEuropean base Finally relations, for example that Shannon will be a base
32 SEMANTIC ANNOTATION AND HUMAN LANGUAGE TECHNOLOGY
Trang 15of Ryanair, and events, for example that Ryanair will invest s180 million
in Shannon
These various types of IE provide progressively higher-level tion about texts They are described in more detail below; for a thoroughdiscussion and examples see Cunningham (2005)
informa-3.2.2 Entities
The simplest and most reliable IE technolog is entity recognition, which
we will abbreviate NE following the original Message UnderstandingConference (MUC) definitions (SAIC, 1998) NE systems identify all thenames of people, places, organisations, dates, amounts of money, etc.All things being equal, NE recognition can be performed at up toaround 95 % accuracy Given that human annotators do not perform tothe 100 % level (measured by inter-annotator comparisons), NE recogni-tion can now be said to function at human performance levels, andapplications of the technology are increasing rapidly as a result
The process is weakly domain-dependent, that is changing the subjectmatter of the texts being processed from financial news to other types ofnews would involve some changes to the system, and changing fromnews to scientific papers would involve quite large changes
3.2.3 Mentions
Finding the mentions of entities involves using of coreference resolution(CO) to identify identity relations between entities in texts These entitiesare both those identified by NE recognition and anaphoric references tothose entities For example, in:
‘Alas, poor Yorick, I knew him Horatio’
coreference resolution would tie ‘Yorick’ with ‘him’ (and ‘I’ with Hamlet,
if sufficient information was present in the surrounding text)
This process is less relevant to end users than other IE tasks (i.e.whereas the other tasks produce output that is of obvious utility for theapplication user, this task is more relevant to the needs of the applicationdeveloper) For text browsing purposes, we might use CO to highlight alloccurrences of the same object or provide hypertext links between them
CO technology might also be used to make links between documents.The main significance of this task, however, is as a building block for TEand ST (see below) CO enables the association of descriptive informationscattered across texts with the entities to which it refers
CO breaks down into two sub-problems: anaphoric resolution (e.g., ‘I’with Hamlet); proper-noun resolution Proper-noun coreference identi-fication finds occurences of same object represented with different
Trang 16spelling or compounding, for example ‘IBM’, ‘IBM Europe’, tional Business Machines Ltd’, ) CO resolution is an impreciseprocess, particularly when applied to the solution of anaphoric reference.
‘Interna-CO results vary widely; depending on domain perhaps only 50–60 %may be relied upon CO systems are domain dependent
3.2.4 Descriptions
The description extraction task builds on NE recognition and coreferenceresolution, associating descriptive information with the entities Tomatch the original MUC definitions as before, we will abbreviate thistask as ‘TE’ For example, in a news article the ‘Bush administration’ can
be also referred to as ‘government officials’—the TE task discovers thisautomatically and adds it as an alias
Good scores for TE systems are around 80 % (on similar tasks humanscan achieve results in the mid 90s, so there is some way to go) As in NErecognition, the production of TEs is weakly domain dependent, that ischanging the subject matter of the texts being processed from financialnews to other types of news would involve some changes to the system,and changing from news to scientific papers would involve quite largechanges
3.2.5 Relations
As described in Appelt (1999), ‘The template relation task requires theidentification of a small number of possible relations between thetemplate elements identified in the template element task This might
be, for example, an employee relationship between a person and acompany, a family relationship between two persons, or a subsidiaryrelationship between two companies Extraction of relations amongentities is a central feature of almost any information extraction task,although the possibilities in real-world extraction tasks are endless’ Ingeneral good template relation (TR) system scores reach around 75 % TR
is a weakly domain dependent task
3.2.6 Events
Finally, event extraction, which is abbreviated ST, for scenario template,the MUC style of representing information relating to events (In someways STs are the prototypical outputs of IE systems, being the originaltask for which the term was coined.) They tie together TE entities and TRrelations into event descriptions For example, TE may have identified
Mr Smith and Mr Jones as person entities and a company present in a
34 SEMANTIC ANNOTATION AND HUMAN LANGUAGE TECHNOLOGY