Our high-definition TVjcomputer workstations will have access to a large number of databases, including digital libraries, image and video databases that will distribute vast amounts of
Trang 129.1 Mobile Databases I 919 29.1.2 Characteristics of Mobile Environments
As we discussed in the previous section, the characteristics of mobile computing include
high communication latency, intermittent wireless connectivity, limited battery life, and,
of course, changing client location Latency is caused by the processes unique to the
wire-less medium, such as coding data for wirewire-less transfer, and tracking and filtering wirewire-less
signals at the receiver Battery life is directly related to battery size, and indirectly related
to the mobile device's capabilities Intermittent connectivity can be intentional or
unin-tentional Unintentional disconnections happen in areas wireless signals cannot reach,
e.g., elevator shafts or subway tunnels Intentional disconnections occur by user intent,
e.g., during an airplane takeoff, or when the mobile device is powered down Finally,
cli-ents are expected to move, which alters the network topology and may cause their data
requirements to change All of these characteristics impact data management, and robust
mobile applications must consider them in their design.'
To compensate for high latencies and unreliable connectivity, clients cache replicas of
important, frequently accessed data, and work offline, if necessary Besides increasing data
availability and response time, caching can also reduce client power consumption by
eliminating the need to make energy-consuming wireless data transmissions for each data
access
On the other hand, the server may not be able to reach a client A client may be
unreachable because it is dozing-in an energy-conserving state in which many
subsystems are shut down or because it is out of range of a base station In either case,
neither client nor server can reach the other, and modifications must be made to the
architecture in order to compensate for this case Proxies for unreachable components are
added to the architecture For a client (and symmetrically for a server), the proxy can
cache updates intended for the server When a connection becomes available, the proxy
automatically forwards these cached updates to their ultimate destination
As suggested above, mobile computing poses challenges for servers as well as clients
The latency involved in wireless communication makes scalability a problem Because
latency due to wireless communications increases the time toservice each client request,
the server can handle fewer clients One way servers relieve this problem is by broadcasting
data whenever possible Broadcast takes advantage of a natural characteristic of radio
communications, and is scalable because a single broadcast of a data item can satisfy all
outstanding requests for it For example, instead of sending weather information to all
clients in a cell individually, a server can simply broadcast it periodically Broadcast also
reduces the load on the server, as clients do not have to maintain active connections to it
Client mobility also poses many data management challenges First, servers must
keep track of client locations in order to efficiently route messages to them Second,
client data should be stored in the network location that minimizes the traffic necessary
to access it Keeping data in a fixed location increases access latency if the client moves
"far away" from it Finally, as stated above, the act of moving between cells must be
3 This architecture is based on the IETF proposal in IETF(1999) with comments by Carson and
Macker (1999)
Trang 2transparent to the client The server must be able to gracefully divert the shipment ofdata from one base station to another, without the client noticing.
Client mobility also allows new applications that are location-based. For example,consider an electronic valet application that can tell a user the location of the nearestrestaurant Clearly, "nearest" is relative to the client's current position, and movementcan invalidate any previously cached responses Upon movement, the client mustefficiently invalidate parts of its cache and request updated data from the database
29.1.3 Data Management Issues
From a data management standpoint, mobile computing may be considered a variation ofdistributed computing Mobile databases can be distributed under two possible scenarios:
1 The entire database is distributed mainly among the wired components, possiblywith full or partial replication A base station or fixed host manages its own data-base with a DBMS-like functionality, with additional functionality for locatingmobile units and additional query and transaction management features to meetthe requirements of mobile environments
2.The database is distributed among wired and wireless components Data ment responsibility is shared among base stations or fixed hosts and mobile units.Hence, the distributed data management issues we discussed in Chapter 24 canalso be applied to mobile databases with the following additional considerations andvariations:
manage-• Data distribution and replication: Data is unevenly distributed among the base stationsand mobile units The consistency constraints compound the problem of cache man-agement Caches attempt to provide the most frequently accessed and updated data
to mobile units that process their own transactions and may be disconnected overlong periods
• Transaction models: Issues of fault tolerance and correctness of transactions are vated in the mobile environment A mobile transaction is executed sequentiallythrough several base stations and possibly on multiple data sets depending upon themovement of the mobile unit Central coordination of transaction execution is lack-ing, particularly in scenario (2) above Moreover, a mobile transaction is expectedto
aggra-be long-lived aggra-because of disconnection in mobile units Hence, traditional ACIDproperties of transactions (see Chapter 19) may need to be modified and new transac-tion models must be defined
• Query processing:Awareness of where data is located is important and affects the cost/benefit analysis of query processing Query optimization is more complicated because
of mobility and rapid resource changes of mobile units The query response needs to
be returned to mobile units that may be in transit or may cross cell boundaries yetmust receive complete and correct query results
• Recovery and fault tolerance: The mobile database environment must deal with site,media, transaction, and communication failures Site failure of a mobile unit is fre-
Trang 329.1 Mobile Databases I 921
quent due to limited battery power A voluntary shutdown of a mobile unit should
not be treated as a failure Transaction failures are routine during handoff when a
mobile unit crosses cells The transaction manager should be able to deal with such
frequent failures
• Mobile database design: The global name resolution problem for handling queries is
compounded because of mobility and frequent shutdown Mobile database design
must consider many issues of metadata management-for example, the constant
updating of location information
• Location-based service: As clients move, location-dependent cache information may
become stale Eviction techniques are important in this case Furthermore,
fre-quently updating location dependent queries, then applying these (spatial) queries in
ordertorefresh the cache poses a problem
• Division of labor: Certain characteristics of the mobile environment force a change in
the division of labor in query processing In some cases, the client must function
independent of the server However, what are the consequences of allowing full
inde-pendent access to replicated data? The relationship between client responsibilities
and their consequences has yet tobe developed
• Security: Mobile data is less secure than that which is left at the fixed location Proper
techniques for managing and authorizing access to critical data become more
impor-tant in this environment Data is also more volatile, and techniques must be able to
compensate for its loss
29.1.4 Application: Intermittently Synchronized
Databases
One mobile computing scenario is becoming increasingly commonplace as people
con-duct their work away from their offices and homes and perform a wide range of activities
and functions: all kinds of sales, particularly in pharmaceuticals, consumer goods, and
industrial parts; law enforcement; insurance and financial consulting and planning; real
estate or property management activities; courier and transportation services, and so on
In these applications, a server or a group of servers manages the central database and the
clients carry laptops or palmtops with a residentDBMSsoftware to do "local" transaction
activity for most of the time The clients connect via a network or a dial-up connection
(or possibly even through the Internet) with the server, typically for a short session-say,
30 to 60 minutes They send their updates to the server, and the server must in turn enter
them in its central database, which must maintain up-to-date data and prepare
appropri-ate copies for all clients on the system Thus, whenever clients connect-through a
pro-cess known in the industry as synchronization of a client with a server-they receive a
batch of updates to be installed on their local database The primary characteristic of this
scenario is that the clients are mostly disconnected; the server is not necessarily able to
reach them This environment has problems similar to those in distributed and
client-server databases, and some from mobile databases, but presents several additional research
problems for investigation We refer to this environment as Intermittently Synchronized
Trang 4Database Environment (ISOBE), and the corresponding databases as Intermittently
Syn-chronized Databases (ISOBs)
Together, the following characteristics of ISOB's make themdistinctfrom the mobiledatabases we have discussed thus far:
1.A client connects to the server when it wants to exchange updates This nication may be unicast-one-on-one communication between the server and theclient-or multicast-one sender or server may periodically communicate to a set
commu-of receivers or update a group commu-of clients
2 A server cannot connect to a client at will
3 Issues of wireless versus wired client connections and power conservation are erally immaterial
gen-4 A client is free to manage its own data and transactions while it is disconnected
Itcan also perform its own recovery to some extent
5 A client has multiple ways of connecting to a server and, in case of many servers,may choose a particular server to connect to based on proximity, communicationnodes available, resources available, etc
Because of such differences, there is a need to address a number of problems related
to ISOBs that are different from those typically involving mobile database systems Theseinclude server database design for server databases, consistency and synchronizationmanagement among client and server databases, transaction and update processing,efficient use of the server bandwidth, and achieving scalability in the ISOB environments
29.1.5 Selected Bibliography for Mobile Databases
There has been a sudden surge of interest in mobile computing, and research on mobiledatabases has had a significant growth for the last five to six years The June 1995 issue of
Byte magazine discusses many aspects of mobile computing Among books written on this
topic, Dhawan (1997) is an excellent source on mobile computing Wireless networks andtheir future are discussed in Holtzman and Goodman (1993) Imielinski and Badrinath(1994) provide a good survey of mobile database issues and also discuss in Imielinski andBadrinath (1992) data and metadata allocation in a mobile architecture Dunham andHelal (1995) discuss problems of query processing, data distribution, and transaction man-agement for mobile databases Foreman and Zahorjan (1994) describe the capabilities andthe problems of mobile computing and make a convincing argument in its favor as a viablesolution for many information system applications of the future Pitoura and Samaras(1998) describe all aspects of mobile database problems and solutions Chrysanthis (1993)describes a transaction model that is designed to operate in an environment with mobileclients In particular, this model allows a client to share the transaction processing load withproxies in order to facilitate mobility Bertino et al (1998) discuss approaches to fault toler-ance and recovery in mobile databases Acharya et al (1995) consider broadcast schedulesthat minimize average query latency, and explore the impact of such schedules on optimalclient caching strategies Milojicic et al (2002) present a tutorial on peer-to-peer comput-
Trang 529.2 Multimedia Databases I 923
ing Corson and Macker (1999) is a response to IETF(1999) report that discusses the mobile
ad-hoc networking protocol performance issues Broadcasting (or pushing) data as a means
of scalably disseminating information to clients is covered in Yee et al (2002) Chintalapati
et al (1997) provide an adaptive location management algorithm Jensen et al (200l)
dis-cuss data management issues as they pertain to location-based services Wolfson (200l)
describes a novel way of efficiently modeling object mobility by describing position using
trajectories instead of points For an initial discussion of the ISOB scalability issues and an
approach by aggregation of data and grouping of clients, see Mahajan et al (1998) Specific
aggregation algorithms for grouping data at the server in ISOB applications are described in
Yee et al (200l) Gray et al (1993) discuss ISOB update conflicts and resolution techniques
under various ISOB architectures Breibart et al (1999) go into further detail about deferred
synchronization algorithms for replicated data
In the years ahead multimedia information systems are expected to dominate our daily lives
Our houses will be wired for bandwidth to handle interactive multimedia applications Our
high-definition TVjcomputer workstations will have access to a large number of databases,
including digital libraries, image and video databases that will distribute vast amounts of
multisource multimedia content
29.2.1 The Nature of Multimedia Data and
Applications
In Section 24.3 we discussed the advanced modeling issues related to multimedia data We
also examined the processing of multiple types of data in Chapter 22 in the context of
object relational OBMSs (OROBMSs) OBMSs have been constantly adding to the types of data
they support Today the following types of multimedia data are available in current systems:
• Text:May be formatted or unformatted For ease of parsing structured documents,
standards like SOML and variations such as HTML are being used
• Graphics: Examples include drawings and illustrations that are encoded using some
descriptive standards (e.g., COM, PICT, postscript}
• Images: Includes drawings, photographs, and so forth, encoded in standard formats
such as bitmap, JPEO, and MPEO Compression is built into JPEO and MPEO These
images are not subdivided into components Hence querying them by content (e.g.,
find all images containing circles) is nontrivial
• Animations: Temporal sequences of image or graphic data
• Video: A set of temporally sequenced photographic data for presentation at specified
rates-for example, 30 frames per second
• Structured audio: A sequence of audio components comprising note, tone, duration,
and so forth
Trang 6• Audio:Sample data generated from aural recordings in a string of bits in digitizedform Analog recordings are typically converted into digital form before storage.
• Compositeormixed multimedia data: A combination of multimedia data types such asaudio and video which may be physically mixed to yield a new storage format or log-ically mixed while retaining original types and formats Composite data also containsadditional control information describing how the information should be rendered
Nature of Multimedia Applications. Multimedia data may be stored, delivered,and utilized in many different ways Applications may be categorized based on their datamanagement characteristics as follows:
• Repository applications: A large amount of multimedia data as well as metadata isstored for retrieval purposes A central repository containing multimedia data may bemaintained by aDBMSand may be organized into a hierarchy of storage levels-localdisks, tertiary disks and tapes, optical disks, and so on Examples include repositories
of satellite images, engineering drawings and designs, space photographs, and ogy scanned pictures
radiol-• Presentation applications:A large number of applications involve delivery of multimediadata subject to temporal constraints Audio and video data are delivered this way; inthese applications optimal viewing or listening conditions require the DBMS to deliverdata at certain rates offering "quality of service" above a certain threshold Data is con-sumed as it is delivered, unlike in repository applications, where it may be processedlater (e.g., multimedia electronic mail) Simple multimedia viewing of video data, forexample, requires a system to simulate VCR-like functionality Complex and interac-tive multimedia presentations involve orchestration directions to control the retrievalorder of components in a series or in parallel Interactive environments must supportcapabilities such as real-time editing analysis or annotating of video and audio data
• Collaborative work using multimedia information:This is a new category of applications
in which engineers may execute a complex design task by merging drawings, fittingsubjects to design constraints, and generating new documentation, change notifica-tions, and so forth Intelligent healthcare networks as well as telemedicine willinvolve doctors collaborating among themselves, analyzing multimedia patient dataand information in real time as it is generated
All of these application areas present major challenges for the design of multimediadatabase systems
29.2.2 Data Management Issues
Multimedia applications dealing with thousands of images, documents, audio and video ments, and free text data depend critically on appropriate modeling of the structure andcontent of data and then designing appropriate database schemas for storing and retrievingmultimedia information Multimedia information systems are very complex and embrace alarge set of issues, including the following:
Trang 7seg-29.2 Multimedia Databases I 925
• Modeling: This area has the potential for applying database versus information retrieval
techniques to the problem There are problems of dealing with complex objects (see
Chapter 20) made up of a wide range of types of data: numeric, text, graphic
(com-puter-generated image), animated graphic image, audio stream, and video sequence
Documents constitute a specialized area and deserve special consideration
• Design:The conceptual, logical, and physical design of multimedia databases has not
been addressed fully, and it remains an area of active research The design process can
be based on the general methodology described in Chapter 12, but the performance
and tuning issues at each level are far more complex
• Storage: Storage of multimedia data on standard disklike devices presents problems of
representation, compression, mapping to device hierarchies, archiving, and buffering
during the input/output operation Adhering to standards such as JPEO or MPEO is one
way most vendors of multimedia products are likely to deal with this issue In DBMSs,
a "BLOB" (Binary Large Object) facility allows untyped bitmaps to be stored and
retrieved Standardized software will be required to deal with synchronization and
compression/decompression, and will be coupled with indexing problems, which are
still in the research domain
• Queries and retrieval: The "database" way of retrieving information is based on query
languages and internal index structures The "information retrieval" way relies
strictly on keywords or predefined index terms For images, video data, and audio
data, this opens up many issues, among them efficient query formulation, query
exe-cution, and optimization The standard optimization techniques we discussed in
Chapter 16 need to be modified to work with multimedia data types
• Performance:For multimedia applications involving only documents and text,
perfor-mance constraints are subjectively determined by the user For applications involving
video playback or audio-video synchronization, physical limitations dominate For
instance, video must be delivered at a steady rate of 60 frames per second Techniques
for query optimization may compute expected response time before evaluating the
query The use of parallel processing of data may alleviate some problems, but such
efforts are currently subject to further experimentation
Such issues have given rise to a variety of open research problems We look at a few
representative problems now
Information Retrieval Perspective in Querying Mutimedia Databases.
Modeling data content has not been an issue in database models and systems because the
data has a rigid structure and the meaning of a data instance can be inferred from the
schema In contrast, information retrieval(IR) is mainly concerned with modeling the
con-tent of text documents (through the use of keywords, phrasal indexes, semantic networks,
word frequencies, soundex encoding, and so on) for which structure is generally neglected
By modeling content, the system can determine whether a document is relevant to a query
Trang 8by examining the content-descriptors of the document Consider, for instance, an insurancecompany's accident claim report as a multimedia object: it includes images of the accident,structured insurance forms, audio recordings of the parties involved in the accident, the textreport of the insurance company's representative, and other information Which data modelshould be used to represent multimedia information such as this? How should queries be for-mulated against this data? Efficient execution thus becomes a complex issue, and the seman-tic heterogeneity and representational complexity of multimedia information gives risetomany new problems.
Requirements of Multimedia/Hypermedia Data Modeling and Retrieval.
To capture the full expressive power of multimedia data modeling, the system should have ageneral construct that lets the user specify links between any two arbitrary nodes Hyperme-dia links, orhyperlinks,have a number of different characteristics:
• Links can be specified with or without associated information, and they may havelarge descriptions associated with them
• Links can start from a specific point within a node or from the whole node
• Links can be directional or nondirectional when they can be traversed in eitherdirection
The link capability of the data model should take into account all of these variations.When content-based retrieval of multimedia data is needed, the query mechanism shouldhave access to the links and the link-associated information The system should providefacilities for defining views over all links-private and public Valuable contextualinformation can be obtained from the structural information Automatically generatedhypermedia links do not reveal anything new about the two nodes, and in contrast tomanually generated hypermedia links, would have different significance Facilities forcreating and utilizing such links, as well as developing and using navigational querylanguages to utilize the links, are important features of any system permitting effective use
of multimedia information This area is important to interlinked databases on thewww.
The World Wide Web presents an opportunity to access a vast amount of informationvia an array of unstructured and structured databases that are interlinked The phenomenalsuccess and growth of the web has made the problem of finding, accessing, and maintainingthis information extremely challenging For the last few years several projects areattempting to define frameworks and languages that will allow us to define the semanticcontent of the web that will be machine processable The effort is collectively known by theterm semantic web The RDF (resource description framework), XHTML (ExtensibleHypertext Markup Language), DAML (DARPA Agent Markup Language), and OIL(Ontology Inference Layer) are among some of its major components.t Further details areoutside the scope of our discussion
Indexing of Images. There are two approaches to indexing images: (1) identifyingobjects automatically through image-processing techniques, and(2) assigning index terms
4 See Fensel (2000) foran overview of these terms
Trang 929.2 Multimedia Databases I 927
and phrases through manual indexing An important problem in using image-processing
techniques to index pictures relates to scalability The current state of the art allows the
indexing of only simple patterns in images Complexity increases with the number of
recog-nizable features Another important problem relates to the complexity of the query Rules
and inference mechanisms can be used to derive higher-level facts from simple features of
images Similarly, abstraction can be used to capture concepts that are not simply possible to
define in terms of a set of <attribute, value> pairs This allows high-level queries like "find
hotel buildings that have open foyers and allow maximum sunshine in the front desk area"
in an architectural application
The information-retrieval approach to image indexing is based on one of three
indexing schemes:
1.Classificatory systems: Classifies images hierarchically into predetermined
catego-ries In this approach, the indexer and the user should have a good knowledge of
the available categories Finer details of a complex image and relationships among
objects in an image cannot be captured
2 Keyword-based systems: Uses an indexing vocabulary similar to that used in the
indexing of textual documents Simple facts represented in the image (like
"ice-capped region") and facts derived as a result of high-level interpretation by
humans (like permanent ice, recent snowfall, and polar ice) can be captured
3 Entity-attribute-relationship systems:All objects in the picture and the relationships
between objects and the attributes of the objects are identified
In the case of text documents, an indexer can choose the keywords from the pool of
words available in the document to be indexed This is not possible in the case of visual
and video data
Problems in Text Retrieval. Text retrieval has always been the key feature in
busi-ness applications and library systems, and although much work has gone into some of the
following problems, there remains an ongoing need for improvement, especially regarding
the following issues:
• Phrase indexing: Substantial improvements can be realized if phrase descriptors (as
opposed to single-word index terms) are assigned to documents and used in queries,
provided that these descriptors are good indicators of document content and
infor-mation need
• Use of thesaurus:One reason for the poor recall of current systems is that the
vocabu-lary of the user differs from the vocabuvocabu-lary used to index the documents One
solu-tion is to use a thesaurus to expand the user's query with related terms The problem
then becomes one of finding a thesaurus for the domain of interest Another resource
in this context is ontologies An ontology necessarily entails or embodies some sort
of world view with respect to a given domain The world view is often conceived as a
set of concepts (e.g entities, attributes, process), their definitions and their
inter-relationships which describe a target world An ontology can be constructed in two
ways, domain dependent and generic The purpose of generic ontologies is to make a
Trang 10general framework for all ( or most) categories encountered by human existence Avariety of domain ontologies such as gene ontology (see Section 29.4) or ontology forelectronic components have been constructed'
• Resolving ambiguity; One of the reasons for low precision (the ratio of the number of
relevant items retrieved to the total number of retrieved items) in text informationretrieval systems is that words have multiple meanings One way to resolve ambiguity
is to use an online dictionary or ontology; another is to compare the contexts inwhich the two words occur
In the first three decades of DBMS development-roughly from 1965 to 1995-theprimary focus had been on the management of mostly numeric business and industrial data Inthe next few decades, nonnumeric textual information will probably dominate databasecontent The text retrieval problem is becoming very relevant in the context of HTML andXML documents The web currently contains several billion of these pages Search enginesfind relevant documents given lists of words which is a case of free form natural languagequery Obtaining the corrrect result that meets the requirements of both precision (% ofretrieved documents that are relevant) and recall (%of total relevant documents that areretrieved), which are standard metrics in information retrieval, remains a challenge As aconsequence, a variety of functionalities involving comparison, conceptualization, under-standing, indexing, and summarization of documents will be added to DBMSs Multimediainformation systems promise to bring about a joining of disciplines that have historically beenseparate areas: information retrieval and database management
29.2.4 Multimedia Database Applications
Large-scale applications of multimedia databases can be expected to encompass a largenumber of disciplines and enhance existing capabilities Some important applications will
be involved:
• Documents and records management: A large number of industries and businesses keep
very detailed records and a variety of documents The data may include engineeringdesign and manufacturing data, medical records of patients, publishing material, andinsurance claim records
• Knowledge dissemination: The multimedia mode, a very effective means of knowledge
dissemination, will encompass a phenomenal growth in electronic books, catalogs,manuals, encyclopedias and repositories of information on many topics
• Education and training: Teaching materials for different audiences-from kindergarten
students to equipment operators toprofessionals-can be designed from multimediasources Digital libraries are expected to have a major influence on the way futurestudents and researchers as well as other users will access vast repositories of educa-tional material
5 A good discussion of ontologies is given in Uschold and Gruninger (1996)
Trang 1129.2 Multimedia Databases I 929
• Marketing, advertising, retailing, entertainment, and travel:There are virtually no limits
to using multimedia information in these applications-from effective sales
presenta-tionstovirtual tours of cities and art galleries The film industry has already shown
the power of special effects in creating animations and synthetically designed
ani-mals, aliens, and special effects The use of predesigned stored objects in multimedia
databases will expand the range of these applications
• Real-time control and monitoring:Coupled with active database technology (see
Chap-ter 24), multimedia presentation of information can be a very effective means for
monitoring and controlling complex tasks such as manufacturing operations, nuclear
power plants, patients in intensive care units, and transportation systems
Commercial Systems for Multimedia Information Management. There are no
OBMSs designed for the sole purpose of multimedia data management, and therefore there
are none that have the range of functionality required to fully support all of the
multimedia information management applications that we discussed above However,
several OBMSs today support multimedia data types; these include lnformix Dynamic
Server, OB2 Universal database (UOB) of IBM, Oracle 9 and 10, CA- JASMINE, Sybase, OOB
II All of these OBMSs have support for objects, which is essential for modeling a variety of
complex multimedia objects One major problem with these systems is that the "blades,
cartridges, and extenders" for handling multimedia data are designed in a very ad hoc
manner The functionality is provided without much apparent attention to scalability and
performance There are products available that operate either stand-alone or in
conjunction with other vendors' systems to allow retrieval of image data by content They
include Virage, Excalibur, and IBM's QBIC Operations on multimedia need to be
standardized The MPEG- 7 and other standards are addressing some of these issues
29.2.5 Selected Bibliography on Multimedia Databases
Multimedia database management is becoming a very heavily researched area with
sev-eral industrial projects on the way Grosky (1994, 1997) provides two excellent
tutori-als on the topic Pazandak and Srivastava (1995) provide an evaluation of database
systems related to the requirements of multimedia databases Grosky et al (1997)
con-tains contributed articles including a survey on content-based indexing and retrieval by
]agadish (1997) Faloutsos et al (1994) also discuss a system for image querying by
con-tent.Li et al (1998) introduce image modeling in which an image is viewed as a
hierar-chical structured complex object with both semantics and visual properties Nwosu et
al (1996) and Subramanian and ]ajodia (1997) have written books on the topic
Lassila (1998) discusses the need for metadata for accessing mutimedia information on
the web; the semantic web effort is summarized in Fensel (2000) Khan (2000) did a
dissertation on ontology-based information retrieval Uschold and Gruninger (1996) is
a good resource on ontologies Corcho et al (2003) compare ontology languages and
discuss methodologies to build ontologies Multimedia content analysis, indexing, and
filtering are discussed in Dimitrova (1999) A survey of content-based multimedia
Trang 12retrieval is provided by Yoshitaka and Ichikawa (1999) The followingWWWreferencesmay be consulted for additional information:
CA- JASMINE (Multimedia ODBMS): http://www.cai.com/products/iasmine.htmExcalibur technologies: http://www.excalib.com
Virage, Inc (Content based image retrieval): http://www.virage.comIBM's QBlC (Query by Image Content) product:
Geographic information systems (GIS) are used to collect, model, store, and analyzeinformation describing physical properties of the geographical world The scope of GISbroadly encompasses two types of data: (1) spatial data, originating from maps, digitalimages, administrative and political boundaries, roads, transportation networks; physicaldata such as rivers, soil characteristics, climatic regions, land elevations, and (2) nonspa-tial data, such as socio-economic data (like census counts), economic data, and sales ormarketing information GIS is a rapidly developing domain that offers highly innovativeapproachestomeet some challenging technical demands
29.3.1 GIS Applications
Itis possible to divide GISs into three categories: (1) cartographic applications, (2) digitalterrain modeling applications, and (3) geographic objects applications Figure 29.3summarizes these categories
Incartographic and terrain modeling applications, variations in spatial attributes arecaptured-for example, soil characteristics, crop density, and air quality Ingeographicobjects applications, objects of interest are identified from a physical domain-forexample, power plants, electoral districts, property parcels, product distribution districts,and city landmarks These objects are related with pertinent application data-whichmay be, for this specific example, power consumption, voting patterns, property salesvolumes, product sales volume, and traffic density
The first two categories of GIS applications require a field-based representation,
whereas the third category requires an object-based one The cartographic approach
involves special functions that can include the overlapping of layers of maps to combineattribute data that will allow, for example, the measuring of distances in three-dimensional space and the reclassification of data on the map Digital terrain modelingrequires a digital representation of parts of earth's surface using land elevations at samplepoints that are connected to yield a surface model such as a three-dimensional net(connected lines in 3D) showing the surface terrain.Itrequires functions of interpolationbetween observed points as well as visualization.Inobject-based geographic applications,additional spatial functions are needed to deal with data related to roads, physicalpipelines, communication cables, power lines, and such For example, for a given region,
Trang 1329.3 Geographic Information Systems I 931
Earth science resource studies Civil engineering and military evaluation Soil surveys Air and water pollution studies Flood control Water resource management
Geographic Objects Applications
Car navigation systems Geographic market analysis
Utility distribution and consumption Consumer product and services- economic analysis
FIGURE 29.3 A possible classification of GIS applications (Adapted from Adam and
Gangopadhyay (1997))
comparable maps can be used for comparison at various points of time to show changes in
certain data such as locations of roads, cables, buildings, and streams
29.3.2 Data Management Requirements of GIS
The functional requirements of theGISapplications above translate into the following
data-base requirements
Data Modeling and Representation. GISdata can be broadly represented in two
formats: (l) vector and (2) raster Vector data represents geometric objects such as points,
lines, and polygons Thus a lake may be represented as a polygon, a river by a series of line
segments Raster data is characterized as an array of points, where each point represents the
value of an attribute for a real-world location Informally, raster images are n-dimensional
arrays where each entry is a unit of the image and represents an attribute Two-dimensional
units are calledpixels, while three-dimensional units are called voxels. Three-dimensional
elevation data is stored in a raster-based digital elevation model (OEM)format Another
ras-ter format called triangular irregular network(TIN) is a topological vector-based approach
that models surfaces by connecting sample points as vertices of triangles and has a point
density that may vary with the roughness of the terrain Rectangular grids (or elevation
Trang 14matrices) are two-dimensional array structures In digital terrain modeling (OTM), themodel also may be used by substituting the elevation with some attribute of interest such aspopulation density or air temperature GIS data often includes a temporal structure in addi-tion to a spatial structure For example, traffic flow or average vehicular speeds in traffic may
be measured every 60 seconds at a set of points in a roadway nework
Data Analysis. GIS data undergoes various types of analysis For example, in tions such as soil erosion studies, environmental impact studies, or hydrological runoff simu-lations, OTM data may undergo various types of geomorphometric analysis-measurementssuch as slope values,gradients (the rate of change in altitude), aspect (the compass direction
applica-of the gradient),profile convexity (the rate of change of gradient), plan convexity (the
con-vexity of contours and other parameters) When GIS data is used for decision support cations, it may undergo aggregation and expansion operations using data warehousing, as
appli-we discussed in Section 28.3 In addition, geometric operations (to compute distances,areas, volumes), topological operations (to compute overlaps, intersections, shortest paths),and temporal operations (to compute internal-based or event-based queries) are involved.Analysis involves a number of temporal and spatial operations, which were discussed inChapter 24
Data Integration. GISs must integrate both vector and raster data from a variety ofsources Sometimes edges and regions are inferred from a raster image to form a vector model,
or conversely, raster images such as aerial photographs are used to update vector models eral coordinate systems such as Universal Transverse Mercator(UTM), latitude/longitude, andlocal cadastral systems are used to identify locations Data originating from different coordi-nate systems requires appropriate transformations Major public sources of geographic data,including the TIGER files maintained by U.S Department of Commerce, are used for roadmaps by many Web-based map drawing tools (e.g., http://maps.yahoo.com) Often there arehigh-accuracy, attribute-poor maps that have to be merged with low-accuracy, attribute-richmaps This is done with a process called "rubber-banding" where the user defines a set of con-trol points in both maps and the transformation of the low accuracy map is accomplished bylining up the control points A major integration issue is to create and maintain attributeinformation (such as air quality or traffic flow), which can be related to and integrated withappropriate geographical information over time as both evolve
Sev-Data Capture. The first step in developing a spatial database for cartographic ing is to capture the two-dimensional or three-dimensional geographical information in dig-ital form-a process that is sometimes impeded by source map characteristics such asresolution, type of projection, map scales, cartographic licensing, diversity of measurementtechniques, and coordinate system differences Spatial data can also be captured fromremote sensors in satellites such as Landsat, NORA, and Advanced Very High ResolutionRadiometer (AVHRR) as well as SPOT HRV (High Resolution Visible Range Instrument),which is free of interpretive bias and very accurate For digital terrain modeling, data cap-ture methods range from manual to fully automated Ground surveys are the traditionalapproach and the most accurate, but they are very time consuming Other techniquesinclude photogrammetric sampling and digitizing cartographic documents
Trang 15model-29.3 Geographic Information Systems I 933
29.3.3 Specific GIS Data Operations
GISapplications are conducted through the use of special operators such as the following:
1 Interpolation: This process derives elevation data for points at which no samples
have been taken.Itincludes computation at single points, computation for a
rect-angular grid or along a contour, and so forth Most interpolation methods are
based on triangulation that uses the TIN method for interpolating elevations
inside the triangle based on those of its vertices
2 Interpretation: Digital terrain modeling involves the interpretation of operations
on terrain data such as editing, smoothing, reducing details, and enhancing
Additional operations involve patching or zipping the borders of triangles (in TIN
data), and merging, which implies combining overlapping models and resolving
conflicts among attribute data Conversions among grid models, contour models,
and TIN data are involved in the interpretation of the terrain
3 Proximity analysis: Several classes of proximity analysis include computations of
"zones of interest" around objects, such as the determination of a buffer around a
car on a highway Shortest path algorithms using 2D or 3D information is an
important class of proximity analysis
4 Raster image processing: This process can be divided into two categories: (1) map
algebra, which is used to integrate geographic features on different map layers to
produce new maps algebraically; and (2) digital image analysis, which deals with
analysis of a digital image for features such as edge detection and object detection
Detecting roads in a satellite image of a city is an example of the latter
5 Analysis of networks: Networks occur in GIS in many contexts that must be
ana-lyzed and may be subjected to segmentations, overlays, and so on Network overlay
refers to a type of spatial join where a given network-for example, a highway
net-work-is joined with a point database-for example, incident locations-to yield,
in this case, a profile of high-incident roadways
Other Database Functionality. The functionality of a GIS database is also subject
to other considerations
• Extensibility: GISs are required to be extensible to accommodate a variety of
con-stantly evolving applications and corresponding data types If a standard DBMS is
used, it must allow a core set of data types with a provision for defining additional
types and methods for those types
• Data quality control:As in many other applications, quality of source data is of
amount importance for providing accurate results to queries This problem is
par-ticularly significant in the GIS context because of the variety of data, sources, and
measurement techniques involved and the absolute accuracy expected by
applica-tions users
6 Visualization: A crucial function in GIS is related to visualization-the graphical
display of terrain information and the appropriate representation of application
Trang 16attributes to go with it Major visualization techniques include (1) contouring
through the use ofisolines, spatial units of lines or arcs of equal attribute values; (2)
hillshading, an illumination method used for qualitative relief depiction using ied light intensities for individual facets of the terrain model; and (3) perspective displays, three-dimensional images of terrain model facets using perspective projec-tion methods from computer graphics These techniques impose cartographic dataand other three-dimensional objects on terrain data providing animated scene ren-derings such as those in flight simulations and animated movies
var-Such requirements clearly illustrate that standard RDBMSs or ODBMSs do not meet thespecial needs of GIS It is therefore necessary to design systems that support the vector andraster representations and the spatial functionality as well as the required DBMS features Apopular GIS software called ARC-INFO, which is not a DBMS but integrates RDBMSfunctionality in the INFO part of the system, is briefly discussed in the subsection that follows.More systems are likely to be designed in the future to work with relational or objectdatabases that will contain some of the spatial and most of the nonspatial information
ARC/INFo-a popular GIS software launched in 1981 by Environmental System ResearchInstitute (ESRr)-uses the arc node model to store spatial data A geographic layer-ealled
coverage in ARC/INFO-eonsists of three primitives: (1) nodes (points), (2) arcs (similar to
lines), and (3) polygons The arc is the most important of the three and stores a largeamount of topological information An arc has a start node and an end node (and it there-fore has direction too).Inaddition, the polygons to the left and the right of the arc are alsostored along with each arc As there is no restriction on the shape of the arc, shape pointsthat have no topological information are also stored along with each arc The databasemanaged by the INFO RDBMS thus consists of three required tables: (1) node attribute table(NAT), (2) arc attribute table (AAT), and (3) polygon attribute table (PAT) Additionalinformation can be stored in separate tables and joined with any of these three tables.The NAT contains an internal!Dfor the node, a user-specified!D,the coordinates ofthe node, and any other information associated with that node (e.g., names of theintersecting roads at the node) The AAT contains an internal !D for the are, a user-specified !D,the internal!Dof the start and end nodes, the internal!Dof the polygons tothe left and the right, a series of coordinates of shape points (if any), the length of the are,and any other data associated with the arc (e.g., the name of the road the arc represents).The PAT contains an internal ID for the polygon, a user-specified !D, the area of the
polygon, the perimeter of the polygon, and any other associated data (e.g., name of thecounty the polygon represents)
Typical spatial queries are related to adjacency, containment, and connectivity The arcnode model has enough information to satisfy all three types of queries, but the RDBMS is notideally suited for this type of querying A simple example will highlight the number of timesarelational database has to be queried to extract adjacency information Assume that we aretrying to determine whether two polygons, A and B,are adjacent to each other We wouldhave to exhaustively look at the entireAATtodetermine whether there is an edge that has A
Trang 1729.3 Geographic Information Systems I 935
on one side and B on the other The search cannot be limited to the edges of either polygon as
we do not explicitly store all the arcs that make a polygon in the PAT Storing all the arcs in
the PAT would be redundant because all the information is already there in the AAT
ESRI has released Arc/Storm (Arc Store Manager) which allows multiple users to use
the same GIS, handles distributed databases, and integrates with other commercial
RDBMSs like ORACLE, INFORMIX, and SYBASE While it offers many performance and
functional advantages over ARC/INFO, it is essentially an RDBMS embedded within a GIS
29.3.5 Problems and Future Issues in GIS
GIS is an expanding application area of databases, reflecting an explosion in the number of
end users using digitized maps, terrain data, space images, weather data, and traffic
informa-tion support data As a consequence, an increasing number of problems related to GIS
appli-cations has been generated and will need to be solved:
1.New architectures:GIS applications will need a new client-server architecture that
will benefit from existing advances in RDBMS and ODBMS technology One
possi-ble solution is to separate spatial from nonspatial data and tomanage the latter
entirely by a DBMS Such a process calls for appropriate modeling and integration
as both types of data evolve Commercial vendors find that it is more viable to
keep a small number of independent databases with an automatic posting of
updates across them Appropriate tools for data transfer, change management, and
workflow management will be required
2 Versioningand object life-cycle approach: Because of constantly evolving
geographi-cal features, GISs must maintain elaborate cartographic and terrain data-a
man-agement problem that might be eased by incremental updating coupled with
update authorization schemes for different levels of users Under the object
life-cycle approach, which covers the activities of creating, destroying, and modifying
objects as well as promoting versions into permanent objects, a complete set of
methods may be predefined to control these activities for GIS objects
3 Data standards: Because of the diversity of representation schemes and models,
formalization of data transfer.standards is crucial for the success of GIS The
inter-national standardization body (rso Tc2l0 and the European standards body
(CEN Tc278) are now in the process of debating relevant issues-among them
conversion between vector and raster data for fast query performance
4 Matching applications and data structures: Looking again at Figure 27.5, we see that
a classification of GIS applications is based on the nature and organization of data
Inthe future, systems covering a wide range of functions-from market analysis
and utilities to car navigation-will need boundary-oriented data and
functional-ity On the other hand, applications in environmental science, hydrology, and
agriculture will require more area-oriented and terrain model data It is not clear
that all this functionality can be supported by a single general-purpose GIS The
specialized needs of GISs will require that general purpose DBMSs must be
Trang 18enhanced with additional data types and functionality before full-fledged GISapplications can be supported.
5 Lack of semantics in data structures: This is evident especially in maps Information
such as highway and road crossings may be difficult to determine based on thestored data One-way streets are also hard to represent in the present GISs Trans-portationCADsystems have incorporated such semantics into GIS
29.3.6 Selected Bibliography for GIS
There are a number of books written on GIS Adam and Gangopadhyay (1997) and Lauriniand Thompson (1992) focus on GIS database and information management problems.Kemp (1993) gives an overview of GIS issues and data sources Huxhold (1991) gives anintruduction to Urban GIS Maguire et al (1991) have a very good collection of GIS-relatedpapers Antenucci (1998) presents a discussion of the GIS technologies Shekhar andChawla (2002) discusses issues and approaches to spatial data management which is at thecore of all GIS Demers (2002) is another recent book on the fundamentals of GIS Bosso-maier and Green (2002) is a primer on GIS operations, languages, metadata paradigms andstandards Peng and Tsou (2003) discusses Internet GIS which includes a suite of emergingnew technologies aimed at making GIS more mobile, powerful, and flexible, as well as betterable to share and communicate geographic information The TIGER files for road data in theUnited States are managed by the U.S Department of Commerce (1993) Laser-Scan'sWeb site (http://www.lsl.co.uk/papers) is a good source of information
Environmental System Research Institute (ESRI) has an excellent library of GISbooks for all levels at http://www.esri.com The GIS terminology is defined at http://www.esri.com/library/glossary/glossary.html The university of Edinburgh maintains aGIS WWW resource list at http://www.geo.ed.ac.uk/home/giswww.html
The biological sciences encompass an enormous variety of information Environmental ence gives us a view of how species live and interact in a world filled with natural phenom-ena Biology and ecology study particular species Anatomy focuses on the overall structure
sci-of an organism, documenting the physical aspects sci-of individual bodies Traditional medicineand physiology break the organism into systems and tissues and strive to collect information
on the workings of these systems and the organism as a whole Histology and cell biologydelve into the tissue and cellular levels and provide knowledge about the inner structureand function of the cell This wealth of information that has been generated, classified, andstored for centuries has only recently become a major application of database technology.Genetics has emerged as an ideal field for the application of information technology
In a broad sense, it can be thought of as the construction of models based on information
Trang 1929.4 Genome Data Management I 937
about genes-which can be defined as basic units of heredity-and populations and the
seeking out of relationships in that information The study of genetics can be divided into
three branches: (1) Mendelian genetics, (2) molecular genetics, and (3) population
genetics Mendelian genetics is the study of the transmission of traits between
generations Molecular genetics is the study of the chemical structure and function of
genes at the molecular level Population genetics is the study of how genetic information
varies across populations of organisms
Molecular genetics provides a more detailed look at genetic information by allowing
researchers to examine the composition, structure, and function of genes The origins of
molecular genetics can be traced to two important discoveries The first occurred in 1869
when Friedrich Miescher discovered nuclein and its primary component, deoxyribonucleic
acid (DNA). In subsequent researchDNA and a related compound, ribonucleic acid (RNA),
were found to be composed of nucleotides (a sugar, a phosphate, and a base, which
combined to form nucleic acid) linked into long polymers via the sugar and phosphate The
second discovery was the demonstration in 1944 by Oswald Avery thatDNAwas indeed the
molecular substance carrying genetic information Genes were thus shown to be composed
of chains of nucleic acids arranged linearly on chromosomes and to serve three primary
functions: (1) replicating genetic information between generations, (2) providing
blueprints for the creation of polypeptides, and (3) accumulating changes-thereby
allowing evolution to occur Waston and Crick found the double-helix structure of the
DNA in 1953, which gave molecular genetics research a new direction.6Discovery of the
DNA and its structure is hailed as probably the most important biological work of the last
100 years, and the field it opened may be the scientific frontier for the next 100 In 1962,
Watson, Crick, and Wilkins won the Nobel Prize for physiology/medicine for this
breakthrough.7
29.4.2 Characteristics of Biological Data
Biological data exhibits many special characteristics that make management of biological
information a particularly challenging problem We will thus begin by summarizing the
characteristics related to biological information, and focusing on a multidisciplinary field
called bioinforrnatics that has emerged, with graduate degree programs now in place in
sev-eral universities Bioinformatics addresses information management of genetic information
with special emphasis on DNA sequence analysis It needs to be broadened into a wider
scope to harness all types of biological information-its modeling, storage, retrieval, and
management Moreover, applications of bioinformatics span design of targets for drugs,
study of mutations and related diseases, anthropological investigations on migration
pat-terms of tribes, and therapeutic treatments
Characteristic 1: Biological data is highly complex when compared with most other
domains orapplications. Definitions of such data must thus be able to represent a complex
substructure of data as well as relationships and to ensure that no information is lost
6 See Nature, 171:737 1953
7 http://www.pbs.org/wgbh/aso/databank/entries/doS3dn.html
Trang 20during biological data modeling The structure of biological data often provides anadditional context for interpretation of the information Biological information systemsmust be able to represent any level of complexity in any data schema, relationship, orschema substructure-not just hierarchical, binary, or table data As an example,MITOMAP is a database documenting the human mitochondrial genome.f This singlegenome is a small, circular piece of DNA encompassing information about 16,569nucleotide bases; 52 gene loci encoding messenger RNA, ribosomal RNA, and transferRNA; 1000 known population variants; over 60 known disease associations; and a limitedset of knowledge on the complex molecular interactions of the biochemical energyproducing pathway of oxidative phosphorylation As might be expected, its managementhas encountered a large number of problems; we have been unable to use the traditionalRDBMSorODBMSapproches to capture all aspects of the data.
Characteristic 2: The amount and range of variabilityindataishigh. Hence, biologicalsystems must be flexible in handling data types and values With such a wide range ofpossible data values, placing constraints on data types must be limited since this mayexclude unexpected values-e.g., outlier values-that are particularly common in thebiological domain Exclusion of such values results in a loss of information In addition,frequent exceptions to biological data structures may require a choice of data types to beavailable for a given piece of data
Characteristic 3: Schemas in biological databases change at a rapid pace.Hence, forimproved information flow between generations or releases of databases, schemaevolution and data object migration must be supported The ability to extend the schema,
a frequent occurrence in the biological setting, is unsupported in most relational andobject database systems Presently systems such as GenBank rerelease the entire databasewith new schemas once or twice a year rather than incrementally changing the system aschanges become necessary Such an evolutionary database would provide a timely andorderly mechanism for following changes to individual data entities in biologicaldatabases over time This sort of tracking is important for biological researchers to be able
to access and reproduce previous results
Characteristic 4: Representations of the same data by different biologists willlikely be different (even when using the same system). Hence, mechanisms for "aligning" differentbiological schemas or different versions of schemas should be supported Given thecomplexity of biological data, there are a multitude of ways of modeling any given entity,with the results often reflecting the particular focus of the scientist While two individualsmay produce different data models if asked tointerpret the same entity, these models willlikely have numerous points in common In such situations, it would be useful tobiological investigators to be able to run queries across these common points By linkingdata elements in a network of schemas, this could be accomplished
Characteristic 5:Most users of biological datadonot require write access to the database; read-only access is adequate. Write access is limited to privileged users calledcurators. Forexample, the database created as part of theMITOMAPproject has on average more than
8 Details ofMITOMAPand its information complexity can be seen in Kogelniket al.(1997, 1998)and at http://www mitomap.org