480 terms 198 classes and, 282 enumeration elements; Purpose: Resource Discovery, Resource Sharing, Arching, Content Classification.; Specification: Narrative and XML Schema Dublin Core:
Trang 1DRAFT – The Rosetta Model: Can the Different Physical Science Data Models be
Reconciled? – DRAFT
Todd A King1 (tking@igpp.ucla.edu)Deborah L McGuinness2,3 (dlm@ksl.stanford.edu)Raymond J Walker1 (rwalker@igpp.ucla.edu)
Peter Fox4 (pfox@ucar.edu)
D Aaron Roberts5 (aaron.roberts@nasa.gov)Christopher Harvey6 (christopher.harvey@cesr.fr)
1 Insitute of Geophysics and Planetary Physics/UCLA, 2835 Slichter Hall, Los Angeles, CA 90095-1567, United States
2 Rensselaer Polytechnic Institute
3 Stanford University, 353 Serra Hall, Stanford, CA 94305, United States
4 UCAR, 1850 Table Mesa Drive, Boulder, CO 80305, United States
5 NASA/NSSDC, Code 692 NASA Goddard Space Flight Center, Greenbelt, MD 20771
6 Centre de Données de la Physique des Plasmas (CDPP), 18 avenue Edouard Belin, TOULOUSE 31 401, France
1 Abstract
There are a variety of data models in the physical sciences, some of which are in
overlapping domains Each of the data models have been derived in different ways Somehave been based on formal ontologies, others on informal ontologies and others on relational schemas An additional complication is that different international agencies have divided the physical science domains into different sub-domains leading to some confusion as to which data model to adopt The most prevalent data models in use today are the Planetary Data System (PDS), Space Physics Archive Search and Extract
(SPASE), Virtual Solar Terrestrial Observatory (VSTO), the International Virtual
Observatory Alliance (IVOA) and the Global Change Master Directory (GCMD) We take a comparative look at the various data models and ask the questions: Can they be reconciled? Is it possible to have a Rosetta Model to translate between each of the
models? What role can ontologies play in defining a Rosetta Model?
2 Descriptions and Metadata
There are many different information models and classification ontologies in use today Each is designed for a particular application Some are very general and others are tailored for a specific discipline Some of the most widely used are:
CAA: Cluster Active Archive Designed to support the archiving and distribution
of high quality calibrated data products from ESA's Cluster mission, using
an approach general enough to be applicable to other environments It has
a Mission, Observatory, Instrument hierarchy The recovered data & metadata is adequate for API use 480 terms (198 classes and, 282 enumeration elements); Purpose: Resource Discovery, Resource Sharing, Arching, Content Classification.; Specification: Narrative and XML Schema
Dublin Core: Originally designed for information resources (documents) and has
been expanded to include data, images, movies, and other types of resources 27 terms (15 core, 12 element types) Purpose: Resource Discovery (published works).; Specification: Narrative
Trang 2IVOA: The International Virtual Observatory Alliance (IVOA) is a set of
standards to "facilitate the international coordination" of the "utilization ofastronomical archives as an integrated and interoperating virtual
observatory." Standards set by the IVOA include VOTable, VOResource, Unified Content Descriptor (UCD) 63 terms (6 categories, 57 terms) and
486 UCD terms for data classification Purpose: Resource Discovery (data, collections, services, and curation) and Content Classification.; Specification: Narrative and XML Schema
OAI-ORE: The Object Reuse and Exchange (ORE) activity of the Open Archives
Initiative (OAI) which is developing specifications that allow distributed repositories to exchange information about their constituent digital
objects The first release of the ORE specifications is scheduled for March
8, 2008 The OAI-ORE is distinct from the OAI-PMH (a protocol for exchanging metadata) – Conceptual only Purpose: Compound Object Description
PDS3: The Planetary Data System (PDS) is a data set nomenclature designed to
be consistent across discipline boundaries and standards for labeling data files Its intent is archive planetary science data and supporting
information to enable effective use and interpretation 14,458 terms (1643 elements and 81 objects 12,734 standard values (2,848 target names, 144 volume sets, 1,966 volumes and 1,370 data set IDs)) Purpose: Archiving; Specification: Narrative, ODL with PDS vocabulary
SPASE: The Space Physics Archive Search and Extract (SPASE) is a data model
designed for the Solar and Space Physics communities to unify the data environment to facilitate finding, retrieving, formatting, and obtaining basic information about data essential for research 340 terms (10 resourcetypes, 35 entities (containers), 30 enumerations, 55 attributes 265 items which are values used in enumeration (controlled lists)) Purpose:
Resource Discovery, Resource Sharing and Content Classification;
Specification: Narrative, XML Schema and XMI
SWEET: Semantic Web for Earth and Environmental Terminology (SWEET)
provides a common semantic framework for various Earth science
initiatives There are 17 ontologies consisting of biosphere,
human_activities, process, substance, data_center, material_thing,
property, sunrealm, data, numerics, sensor, time, earthrealm, phenomena, space, and units 3,940 terms (17 ontologies) Purpose: Reference Model; Specification: OWL
VSTO: Virtual Solar Terrestrial Observatory Originally designed as a set of
ontologies for organizing and integrating information spanning upper atmospheric terrestrial physics to solar physics Fundamental classes include instrument, observatory, data, and services Its upper level has been reused in other science areas including volcanology and plate
tectonics 407 terms (one ontology with 35 top-level classes) Purpose:
Trang 3Resource Discovery, Resource Sharing, and Content Classification Specification: OWL
Trang 43 The Rosetta Model
• Data Structure (Digital)
o Catalog (record collection)
o Table (row, column)
Trang 54 Cluster Active Archive
Designed to support the archiving and distribution of high quality calibrated data
products from ESA's Cluster mission, using an approach general enough to be applicable
to other environments It has a Mission, Observatory, Instrument hierarchy The
recovered data & metadata is adequate for API use
From the Cluster Metadata Dictionary, Issue: 2, Date: May 4, 2006 Rev : 2
Metadata is information which describes a dataset It should be complete, that is, contain all the information required to read and interpret the bits (syntactic description), and to understand what the resulting numerical values (or bit strings) represent (semantic description), including how the data was obtained ; the latter information impacts upon the scientific significance of the data The purpose of the CAA Metadata Dictionary is to describe fully the required CAA metadata information, and to explain how that
information must be formatted so as to be exploitable by the generic software of Cluster Active Archive
There are 6 top-level CAA concepts or classes:
Mission This level contains information relevant to the whole mission
Observatory The Cluster mission consists of 4 observatories : Cluster-1, Cluster-2,
Cluster-3, and Cluster-4
Experiment The Cluster mission has 11 experiments, each identified by its Principal
Investigator, plus the auxiliary data Instrument The Cluster instruments are identified by Observatory and Experiment
Dataset Each instrument produces one or more datasets ; this level of metadata is
common to the whole of each dataset
Trang 6Parameter A dataset contains one or more parameters, each of which has its own
metadataFile Each dataset is composed of ¯les, the number of which will grow regularly
with time during CAA
For CAA, there will be :
one block of metadata at the mission level (for the Cluster mission),
four blocks at the observatory level (Cluster-1, Cluster-2, Cluster-3, Cluster-4)
eleven blocks at the experiment level (one for each of the eleven instruments),
sixty blocks of metadata (listed on page 32) and the instrument level, plus
a further six blocks of metadata for the various auxiliary data products
To recover all the metadata relative to any one dataset it is necessary to know the relation between these blocks of metadata For example, when looking at the metadata associated with the CIS-1 instrument (CIS instrument on Spacecraft 1) it is necessary to know that this is associated with metadata concerning the Experiment CIS and the Observatory Spacecraft-1, and that these are associated with the Mission Cluster Linkage between thedifferent levels (illustrated by the arrows in Fig 1) is provided at each level by concept keywords included specially for this purpose
Overall Characteristics
Scope: 480 terms (198 classes, 282 enumeration elements)
Purpose: Resource Discovery, Resource Sharing, Arching, Content Classification.Specification: XML Schema
References
[CAA] Cluster Metadata Dictionary
http://caa.estec.esa.int/documents/DataD_V22.pdf
Trang 7Simple Dublin Core
The Simple Dublin Core Metadata Element Set (DCMES) consists of 15 metadata
elements
Full information on element definitions and term relationships can be found in the DublinCore Metadata Registry [DCMR]
Qualified Dublin Core
Subsequent to the specification of the original 15 elements, an ongoing process to
develop exemplary terms extending or refining the Dublin Core Metadata Element Set (DCMES) was begun The additional terms were identified, generally in working groups
of the Dublin Core Metadata Initiative, and judged by the DCMI Usage Board to be in
Trang 8conformance with principles of good practice for the qualification of Dublin Core
metadata elements
Element refinements make the meaning of an element narrower or more specific A refined element shares the meaning of the unqualified element, but with a more restricted scope The guiding principle for the qualification of Dublin Core elements, colloquially known as the Dumb-Down Principle, states that an application that does not understand a specific element refinement term should be able to ignore the qualifier and treat the metadata value as if it were an unqualified (broader) element While this may result in some loss of specificity, the remaining element value (without the qualifier) should continue to be generally correct and useful for discovery
DCMI also maintains a small, general vocabulary recommended for use within the element Type This vocabulary currently consists of 12 terms:
A value expressed using an encoding scheme may thus be a token selected from a
controlled vocabulary (e.g., a term from a classification system or set of subject headings)
or a string formatted in accordance with a formal notation (e.g., "2000-12-31" as the standard expression of a date) If an encoding scheme is not understood by an application,the value may still be useful to a human reader
Overall Characteristics
Scope: 27 terms (15 core, 12 element types) [5]
Purpose: Resource Discovery (published works)
Specification: Narrative
References
[DCMR] Dublin Core Official web site
http://dublincore.org/dcregistry/
Trang 9[DCENC] Dublin Core Encoding Guidelines
http://dublincore.org/resources/expressions/
[DCXML] Guidelines for implementing Dublin Core in XML
http://dublincore.org/documents/abstract-model/
Trang 10The IVOA Resource Metadata specification (VOResource) permits describing the
following attributes of a resource [VORES]:
Subject, Description, Source, ReferenceURL, Type, ContentLevel, Relationship, RelationshipID
Collection and service content metadata
Facility, Instrument, Coverage.Spatial, Coverage.RegionOfRegard, Coverage.Spectral, Coverage.Spectral.Bandpass,
Coverage.Spectral.MinimumWavelength, Coverage.Spectral.MaximumWavelength, Coverage.Temporal.StartTime, Coverage.Temporal.StopTime, Coverage.Depth, Coverage.ObjectDensity, Coverage.ObjectCount, Coverage.SkyFraction, Resolution.Spatial,
Resolution.Spectral, Resolution.Temporal, UCD, Format, RightsData quality metadata
DataQuality, ResourceValidationLevel, ResourceValidatedBy, Uncertainty.Photometric,
Uncertainty.Spatial, Uncertainty.Spectral, Uncertainty.TemporalService metadata
Service.AccessURL, Service.InterfaceURL, Service.BaseURL, Service.HTTPResultsMIMEType, Service.StandardID,
Service.MaxSearchRadius, Service.MaxReturnRecords, Service.MaxReturnSize
Unified Content Descriptors
Unified Content Descriptors (UCD) is a formal vocabulary for astronomical data that is controlled by the International Virtual Observatory Alliance (IVOA) The vocabulary is restricted in order to avoid proliferation of terms and synonyms, and controlled in order
to avoid ambiguities A UCD is used to classify a token of information For example, it may be used to identify the type of information in a field of a table or a tagged value in metadata description [VOUCD]
Trang 11All existing UCD1+ words are grouped into 12 main categories These categories
are expressed by the first atom of the word, whose possible values are:
8 pos (positional data)
9 spect (spectral data)
10 src (source)
11 stat (statistics)
12 time (time)
VOTable
The VOTable format is an XML standard for the interchange of data represented as a set
of tables [VOTAB] It extends the HTML Table specification by adding metadata to describe the contents of the table This includes the data type, units and classification of the contents of each field in a table The VOTable format also permits encode binary data
to be included in the table or reference external streams of binary data
Overall Characteristics
Scope: 63 terms (6 categories, 57 terms) and
486 UCD terms for data classification
Purpose: Resource Discovery (data, collections, services, and curation)
Trang 12[VOASTR] Ontology of Astronomical Object Types, Version 1.0, IVOA Working Draft
2007 Feb 19
http://www.ivoa.net/Documents/WD/Semantics/AstrObjectOntology-20070219.pdf
Trang 137 OAI-ORE
The Object Reuse and Exchange (ORE) activity of the Open Archives Initiative (OAI) which is developing specifications that allow distributed repositories to exchange
information about their constituent digital objects The first release of the ORE
specifications is scheduled for March 8, 2008 The ORE is distinct from the PMH (a protocol for exchanging metadata)
OAI-Excerpts from the Object Reuse and Exchange white paper [OAIORE]
Compound information objects are aggregations of distinct information units that when combined form a logical whole Some examples of these are a digitized book that is an aggregation of chapters, where each chapter is an aggregation of scanned pages; a music album that is the aggregation of several audio tracks; an image object that is the
aggregation of a high quality master, a medium quality derivative and a low quality thumbnail; a scholarly publication that is aggregation of text and supporting materials such as datasets, software tools, and video recordings of an experiment; and a multi-page web document with an HTML table of contents that points to multiple interlinked HTML individual pages If we consider all information objects reusable in multiple contexts (a notable feature of networked information), then the aggregation of a specific information unit into a compound object is not due to the inherent nature of the information unit, but the result of the intention of the human author or machine agent that composed the compound object
Research in the Semantic Web community has introduced the notion of named graphs[5], which are essentially a set of RDF assertions, forming a graph, to which a URI is
assigned The graph as a whole then can be treated as a web resource, and assertions such as metadata statements, authority, etc can be associated with that resource These ideas are very promising as an approach to expressing the notion of a compound object
on the web However, they remain in a research phase, and need further specification in order to become adoptable as part of an implementable interoperability specification Our proposals described later in this document build on this notion of a named graph
A core goal of OAI-ORE – Object Reuse and Exchange – is to develop standardized, interoperable, and machine-readable mechanisms to express compound object
information on the web The OAI-ORE standards will make it possible for web clients and applications to reconstruct the logical boundaries of compound objects, the
relationships among their internal components, and their relationships to the other
resources in the web information space This will provide the foundation for the
development of value-adding services for analysis, reuse, and re-composition of
compound objects, especially in the areas of e-Science, e-Scholarship, and scholarly communication, which are the target applications of ORE
To enable widespread adoption of the standards developed by OAI-ORE we have
determined that they must be congruent with and leverage the Web Architecture This architecture essentially consists of: