DATABASE SYSTEMS (phần 24) doc

Our high-definition TVjcomputer workstations will have access to a large number of databases, including digital libraries, image and video databases that will distribute vast amounts of

Trang 1

29.1 Mobile Databases I 919 29.1.2 Characteristics of Mobile Environments

As we discussed in the previous section, the characteristics of mobile computing include

high communication latency, intermittent wireless connectivity, limited battery life, and,

of course, changing client location Latency is caused by the processes unique to the

wire-less medium, such as coding data for wirewire-less transfer, and tracking and filtering wirewire-less

signals at the receiver Battery life is directly related to battery size, and indirectly related

to the mobile device's capabilities Intermittent connectivity can be intentional or

unin-tentional Unintentional disconnections happen in areas wireless signals cannot reach,

e.g., elevator shafts or subway tunnels Intentional disconnections occur by user intent,

e.g., during an airplane takeoff, or when the mobile device is powered down Finally,

cli-ents are expected to move, which alters the network topology and may cause their data

requirements to change All of these characteristics impact data management, and robust

mobile applications must consider them in their design.'

To compensate for high latencies and unreliable connectivity, clients cache replicas of

important, frequently accessed data, and work offline, if necessary Besides increasing data

availability and response time, caching can also reduce client power consumption by

eliminating the need to make energy-consuming wireless data transmissions for each data

access

On the other hand, the server may not be able to reach a client A client may be

unreachable because it is dozing-in an energy-conserving state in which many

subsystems are shut down or because it is out of range of a base station In either case,

neither client nor server can reach the other, and modifications must be made to the

architecture in order to compensate for this case Proxies for unreachable components are

added to the architecture For a client (and symmetrically for a server), the proxy can

cache updates intended for the server When a connection becomes available, the proxy

automatically forwards these cached updates to their ultimate destination

As suggested above, mobile computing poses challenges for servers as well as clients

The latency involved in wireless communication makes scalability a problem Because

latency due to wireless communications increases the time toservice each client request,

the server can handle fewer clients One way servers relieve this problem is by broadcasting

data whenever possible Broadcast takes advantage of a natural characteristic of radio

communications, and is scalable because a single broadcast of a data item can satisfy all

outstanding requests for it For example, instead of sending weather information to all

clients in a cell individually, a server can simply broadcast it periodically Broadcast also

reduces the load on the server, as clients do not have to maintain active connections to it

Client mobility also poses many data management challenges First, servers must

keep track of client locations in order to efficiently route messages to them Second,

client data should be stored in the network location that minimizes the traffic necessary

to access it Keeping data in a fixed location increases access latency if the client moves

"far away" from it Finally, as stated above, the act of moving between cells must be

3 This architecture is based on the IETF proposal in IETF(1999) with comments by Carson and

Macker (1999)

Trang 2

transparent to the client The server must be able to gracefully divert the shipment ofdata from one base station to another, without the client noticing.

Client mobility also allows new applications that are location-based. For example,consider an electronic valet application that can tell a user the location of the nearestrestaurant Clearly, "nearest" is relative to the client's current position, and movementcan invalidate any previously cached responses Upon movement, the client mustefficiently invalidate parts of its cache and request updated data from the database

29.1.3 Data Management Issues

From a data management standpoint, mobile computing may be considered a variation ofdistributed computing Mobile databases can be distributed under two possible scenarios:

1 The entire database is distributed mainly among the wired components, possiblywith full or partial replication A base station or fixed host manages its own data-base with a DBMS-like functionality, with additional functionality for locatingmobile units and additional query and transaction management features to meetthe requirements of mobile environments

2.The database is distributed among wired and wireless components Data ment responsibility is shared among base stations or fixed hosts and mobile units.Hence, the distributed data management issues we discussed in Chapter 24 canalso be applied to mobile databases with the following additional considerations andvariations:

manage-• Data distribution and replication: Data is unevenly distributed among the base stationsand mobile units The consistency constraints compound the problem of cache man-agement Caches attempt to provide the most frequently accessed and updated data

to mobile units that process their own transactions and may be disconnected overlong periods

• Transaction models: Issues of fault tolerance and correctness of transactions are vated in the mobile environment A mobile transaction is executed sequentiallythrough several base stations and possibly on multiple data sets depending upon themovement of the mobile unit Central coordination of transaction execution is lack-ing, particularly in scenario (2) above Moreover, a mobile transaction is expectedto

aggra-be long-lived aggra-because of disconnection in mobile units Hence, traditional ACIDproperties of transactions (see Chapter 19) may need to be modified and new transac-tion models must be defined

• Query processing:Awareness of where data is located is important and affects the cost/benefit analysis of query processing Query optimization is more complicated because

of mobility and rapid resource changes of mobile units The query response needs to

be returned to mobile units that may be in transit or may cross cell boundaries yetmust receive complete and correct query results

• Recovery and fault tolerance: The mobile database environment must deal with site,media, transaction, and communication failures Site failure of a mobile unit is fre-

Trang 3

29.1 Mobile Databases I 921

quent due to limited battery power A voluntary shutdown of a mobile unit should

not be treated as a failure Transaction failures are routine during handoff when a

mobile unit crosses cells The transaction manager should be able to deal with such

frequent failures

• Mobile database design: The global name resolution problem for handling queries is

compounded because of mobility and frequent shutdown Mobile database design

must consider many issues of metadata management-for example, the constant

updating of location information

• Location-based service: As clients move, location-dependent cache information may

become stale Eviction techniques are important in this case Furthermore,

fre-quently updating location dependent queries, then applying these (spatial) queries in

ordertorefresh the cache poses a problem

• Division of labor: Certain characteristics of the mobile environment force a change in

the division of labor in query processing In some cases, the client must function

independent of the server However, what are the consequences of allowing full

inde-pendent access to replicated data? The relationship between client responsibilities

and their consequences has yet tobe developed

• Security: Mobile data is less secure than that which is left at the fixed location Proper

techniques for managing and authorizing access to critical data become more

impor-tant in this environment Data is also more volatile, and techniques must be able to

compensate for its loss

29.1.4 Application: Intermittently Synchronized

Databases

One mobile computing scenario is becoming increasingly commonplace as people

con-duct their work away from their offices and homes and perform a wide range of activities

and functions: all kinds of sales, particularly in pharmaceuticals, consumer goods, and

industrial parts; law enforcement; insurance and financial consulting and planning; real

estate or property management activities; courier and transportation services, and so on

In these applications, a server or a group of servers manages the central database and the

clients carry laptops or palmtops with a residentDBMSsoftware to do "local" transaction

activity for most of the time The clients connect via a network or a dial-up connection

(or possibly even through the Internet) with the server, typically for a short session-say,

30 to 60 minutes They send their updates to the server, and the server must in turn enter

them in its central database, which must maintain up-to-date data and prepare

appropri-ate copies for all clients on the system Thus, whenever clients connect-through a

pro-cess known in the industry as synchronization of a client with a server-they receive a

batch of updates to be installed on their local database The primary characteristic of this

scenario is that the clients are mostly disconnected; the server is not necessarily able to

reach them This environment has problems similar to those in distributed and

client-server databases, and some from mobile databases, but presents several additional research

problems for investigation We refer to this environment as Intermittently Synchronized

Trang 4

Database Environment (ISOBE), and the corresponding databases as Intermittently

Syn-chronized Databases (ISOBs)

Together, the following characteristics of ISOB's make themdistinctfrom the mobiledatabases we have discussed thus far:

1.A client connects to the server when it wants to exchange updates This nication may be unicast-one-on-one communication between the server and theclient-or multicast-one sender or server may periodically communicate to a set

commu-of receivers or update a group commu-of clients

2 A server cannot connect to a client at will

3 Issues of wireless versus wired client connections and power conservation are erally immaterial

gen-4 A client is free to manage its own data and transactions while it is disconnected

Itcan also perform its own recovery to some extent

5 A client has multiple ways of connecting to a server and, in case of many servers,may choose a particular server to connect to based on proximity, communicationnodes available, resources available, etc

Because of such differences, there is a need to address a number of problems related

to ISOBs that are different from those typically involving mobile database systems Theseinclude server database design for server databases, consistency and synchronizationmanagement among client and server databases, transaction and update processing,efficient use of the server bandwidth, and achieving scalability in the ISOB environments

29.1.5 Selected Bibliography for Mobile Databases

There has been a sudden surge of interest in mobile computing, and research on mobiledatabases has had a significant growth for the last five to six years The June 1995 issue of

Byte magazine discusses many aspects of mobile computing Among books written on this

topic, Dhawan (1997) is an excellent source on mobile computing Wireless networks andtheir future are discussed in Holtzman and Goodman (1993) Imielinski and Badrinath(1994) provide a good survey of mobile database issues and also discuss in Imielinski andBadrinath (1992) data and metadata allocation in a mobile architecture Dunham andHelal (1995) discuss problems of query processing, data distribution, and transaction man-agement for mobile databases Foreman and Zahorjan (1994) describe the capabilities andthe problems of mobile computing and make a convincing argument in its favor as a viablesolution for many information system applications of the future Pitoura and Samaras(1998) describe all aspects of mobile database problems and solutions Chrysanthis (1993)describes a transaction model that is designed to operate in an environment with mobileclients In particular, this model allows a client to share the transaction processing load withproxies in order to facilitate mobility Bertino et al (1998) discuss approaches to fault toler-ance and recovery in mobile databases Acharya et al (1995) consider broadcast schedulesthat minimize average query latency, and explore the impact of such schedules on optimalclient caching strategies Milojicic et al (2002) present a tutorial on peer-to-peer comput-

Trang 5

29.2 Multimedia Databases I 923

ing Corson and Macker (1999) is a response to IETF(1999) report that discusses the mobile

ad-hoc networking protocol performance issues Broadcasting (or pushing) data as a means

of scalably disseminating information to clients is covered in Yee et al (2002) Chintalapati

et al (1997) provide an adaptive location management algorithm Jensen et al (200l)

dis-cuss data management issues as they pertain to location-based services Wolfson (200l)

describes a novel way of efficiently modeling object mobility by describing position using

trajectories instead of points For an initial discussion of the ISOB scalability issues and an

approach by aggregation of data and grouping of clients, see Mahajan et al (1998) Specific

aggregation algorithms for grouping data at the server in ISOB applications are described in

Yee et al (200l) Gray et al (1993) discuss ISOB update conflicts and resolution techniques

under various ISOB architectures Breibart et al (1999) go into further detail about deferred

synchronization algorithms for replicated data

In the years ahead multimedia information systems are expected to dominate our daily lives

Our houses will be wired for bandwidth to handle interactive multimedia applications Our

high-definition TVjcomputer workstations will have access to a large number of databases,

including digital libraries, image and video databases that will distribute vast amounts of

multisource multimedia content

29.2.1 The Nature of Multimedia Data and

Applications

In Section 24.3 we discussed the advanced modeling issues related to multimedia data We

also examined the processing of multiple types of data in Chapter 22 in the context of

object relational OBMSs (OROBMSs) OBMSs have been constantly adding to the types of data

they support Today the following types of multimedia data are available in current systems:

• Text:May be formatted or unformatted For ease of parsing structured documents,

standards like SOML and variations such as HTML are being used

• Graphics: Examples include drawings and illustrations that are encoded using some

descriptive standards (e.g., COM, PICT, postscript}

• Images: Includes drawings, photographs, and so forth, encoded in standard formats

such as bitmap, JPEO, and MPEO Compression is built into JPEO and MPEO These

images are not subdivided into components Hence querying them by content (e.g.,

find all images containing circles) is nontrivial

• Animations: Temporal sequences of image or graphic data

• Video: A set of temporally sequenced photographic data for presentation at specified

rates-for example, 30 frames per second

• Structured audio: A sequence of audio components comprising note, tone, duration,

and so forth

Trang 6

• Audio:Sample data generated from aural recordings in a string of bits in digitizedform Analog recordings are typically converted into digital form before storage.

• Compositeormixed multimedia data: A combination of multimedia data types such asaudio and video which may be physically mixed to yield a new storage format or log-ically mixed while retaining original types and formats Composite data also containsadditional control information describing how the information should be rendered

Nature of Multimedia Applications. Multimedia data may be stored, delivered,and utilized in many different ways Applications may be categorized based on their datamanagement characteristics as follows:

• Repository applications: A large amount of multimedia data as well as metadata isstored for retrieval purposes A central repository containing multimedia data may bemaintained by aDBMSand may be organized into a hierarchy of storage levels-localdisks, tertiary disks and tapes, optical disks, and so on Examples include repositories

of satellite images, engineering drawings and designs, space photographs, and ogy scanned pictures

radiol-• Presentation applications:A large number of applications involve delivery of multimediadata subject to temporal constraints Audio and video data are delivered this way; inthese applications optimal viewing or listening conditions require the DBMS to deliverdata at certain rates offering "quality of service" above a certain threshold Data is con-sumed as it is delivered, unlike in repository applications, where it may be processedlater (e.g., multimedia electronic mail) Simple multimedia viewing of video data, forexample, requires a system to simulate VCR-like functionality Complex and interac-tive multimedia presentations involve orchestration directions to control the retrievalorder of components in a series or in parallel Interactive environments must supportcapabilities such as real-time editing analysis or annotating of video and audio data

• Collaborative work using multimedia information:This is a new category of applications

in which engineers may execute a complex design task by merging drawings, fittingsubjects to design constraints, and generating new documentation, change notifica-tions, and so forth Intelligent healthcare networks as well as telemedicine willinvolve doctors collaborating among themselves, analyzing multimedia patient dataand information in real time as it is generated

All of these application areas present major challenges for the design of multimediadatabase systems

29.2.2 Data Management Issues

Multimedia applications dealing with thousands of images, documents, audio and video ments, and free text data depend critically on appropriate modeling of the structure andcontent of data and then designing appropriate database schemas for storing and retrievingmultimedia information Multimedia information systems are very complex and embrace alarge set of issues, including the following:

Trang 7

seg-29.2 Multimedia Databases I 925

• Modeling: This area has the potential for applying database versus information retrieval

techniques to the problem There are problems of dealing with complex objects (see

Chapter 20) made up of a wide range of types of data: numeric, text, graphic

(com-puter-generated image), animated graphic image, audio stream, and video sequence

Documents constitute a specialized area and deserve special consideration

• Design:The conceptual, logical, and physical design of multimedia databases has not

been addressed fully, and it remains an area of active research The design process can

be based on the general methodology described in Chapter 12, but the performance

and tuning issues at each level are far more complex

• Storage: Storage of multimedia data on standard disklike devices presents problems of

representation, compression, mapping to device hierarchies, archiving, and buffering

during the input/output operation Adhering to standards such as JPEO or MPEO is one

way most vendors of multimedia products are likely to deal with this issue In DBMSs,

a "BLOB" (Binary Large Object) facility allows untyped bitmaps to be stored and

retrieved Standardized software will be required to deal with synchronization and

compression/decompression, and will be coupled with indexing problems, which are

still in the research domain

• Queries and retrieval: The "database" way of retrieving information is based on query

languages and internal index structures The "information retrieval" way relies

strictly on keywords or predefined index terms For images, video data, and audio

data, this opens up many issues, among them efficient query formulation, query

exe-cution, and optimization The standard optimization techniques we discussed in

Chapter 16 need to be modified to work with multimedia data types

• Performance:For multimedia applications involving only documents and text,

perfor-mance constraints are subjectively determined by the user For applications involving

video playback or audio-video synchronization, physical limitations dominate For

instance, video must be delivered at a steady rate of 60 frames per second Techniques

for query optimization may compute expected response time before evaluating the

query The use of parallel processing of data may alleviate some problems, but such

efforts are currently subject to further experimentation

Such issues have given rise to a variety of open research problems We look at a few

representative problems now

Information Retrieval Perspective in Querying Mutimedia Databases.

Modeling data content has not been an issue in database models and systems because the

data has a rigid structure and the meaning of a data instance can be inferred from the

schema In contrast, information retrieval(IR) is mainly concerned with modeling the

con-tent of text documents (through the use of keywords, phrasal indexes, semantic networks,

word frequencies, soundex encoding, and so on) for which structure is generally neglected

By modeling content, the system can determine whether a document is relevant to a query

Trang 8

by examining the content-descriptors of the document Consider, for instance, an insurancecompany's accident claim report as a multimedia object: it includes images of the accident,structured insurance forms, audio recordings of the parties involved in the accident, the textreport of the insurance company's representative, and other information Which data modelshould be used to represent multimedia information such as this? How should queries be for-mulated against this data? Efficient execution thus becomes a complex issue, and the seman-tic heterogeneity and representational complexity of multimedia information gives risetomany new problems.

Requirements of Multimedia/Hypermedia Data Modeling and Retrieval.

To capture the full expressive power of multimedia data modeling, the system should have ageneral construct that lets the user specify links between any two arbitrary nodes Hyperme-dia links, orhyperlinks,have a number of different characteristics:

• Links can be specified with or without associated information, and they may havelarge descriptions associated with them

• Links can start from a specific point within a node or from the whole node

• Links can be directional or nondirectional when they can be traversed in eitherdirection

The link capability of the data model should take into account all of these variations.When content-based retrieval of multimedia data is needed, the query mechanism shouldhave access to the links and the link-associated information The system should providefacilities for defining views over all links-private and public Valuable contextualinformation can be obtained from the structural information Automatically generatedhypermedia links do not reveal anything new about the two nodes, and in contrast tomanually generated hypermedia links, would have different significance Facilities forcreating and utilizing such links, as well as developing and using navigational querylanguages to utilize the links, are important features of any system permitting effective use

of multimedia information This area is important to interlinked databases on thewww.

The World Wide Web presents an opportunity to access a vast amount of informationvia an array of unstructured and structured databases that are interlinked The phenomenalsuccess and growth of the web has made the problem of finding, accessing, and maintainingthis information extremely challenging For the last few years several projects areattempting to define frameworks and languages that will allow us to define the semanticcontent of the web that will be machine processable The effort is collectively known by theterm semantic web The RDF (resource description framework), XHTML (ExtensibleHypertext Markup Language), DAML (DARPA Agent Markup Language), and OIL(Ontology Inference Layer) are among some of its major components.t Further details areoutside the scope of our discussion

Indexing of Images. There are two approaches to indexing images: (1) identifyingobjects automatically through image-processing techniques, and(2) assigning index terms

4 See Fensel (2000) foran overview of these terms

Trang 9

and phrases through manual indexing An important problem in using image-processing

techniques to index pictures relates to scalability The current state of the art allows the

indexing of only simple patterns in images Complexity increases with the number of

recog-nizable features Another important problem relates to the complexity of the query Rules

and inference mechanisms can be used to derive higher-level facts from simple features of

images Similarly, abstraction can be used to capture concepts that are not simply possible to

define in terms of a set of <attribute, value> pairs This allows high-level queries like "find

hotel buildings that have open foyers and allow maximum sunshine in the front desk area"

in an architectural application

The information-retrieval approach to image indexing is based on one of three

indexing schemes:

1.Classificatory systems: Classifies images hierarchically into predetermined

catego-ries In this approach, the indexer and the user should have a good knowledge of

the available categories Finer details of a complex image and relationships among

objects in an image cannot be captured

2 Keyword-based systems: Uses an indexing vocabulary similar to that used in the

indexing of textual documents Simple facts represented in the image (like

"ice-capped region") and facts derived as a result of high-level interpretation by

humans (like permanent ice, recent snowfall, and polar ice) can be captured

3 Entity-attribute-relationship systems:All objects in the picture and the relationships

between objects and the attributes of the objects are identified

In the case of text documents, an indexer can choose the keywords from the pool of

words available in the document to be indexed This is not possible in the case of visual

and video data

Problems in Text Retrieval. Text retrieval has always been the key feature in

busi-ness applications and library systems, and although much work has gone into some of the

following problems, there remains an ongoing need for improvement, especially regarding

the following issues:

• Phrase indexing: Substantial improvements can be realized if phrase descriptors (as

opposed to single-word index terms) are assigned to documents and used in queries,

provided that these descriptors are good indicators of document content and

infor-mation need

• Use of thesaurus:One reason for the poor recall of current systems is that the

vocabu-lary of the user differs from the vocabuvocabu-lary used to index the documents One

solu-tion is to use a thesaurus to expand the user's query with related terms The problem

then becomes one of finding a thesaurus for the domain of interest Another resource

in this context is ontologies An ontology necessarily entails or embodies some sort

of world view with respect to a given domain The world view is often conceived as a

set of concepts (e.g entities, attributes, process), their definitions and their

inter-relationships which describe a target world An ontology can be constructed in two

ways, domain dependent and generic The purpose of generic ontologies is to make a

Trang 10

general framework for all ( or most) categories encountered by human existence Avariety of domain ontologies such as gene ontology (see Section 29.4) or ontology forelectronic components have been constructed'

• Resolving ambiguity; One of the reasons for low precision (the ratio of the number of

relevant items retrieved to the total number of retrieved items) in text informationretrieval systems is that words have multiple meanings One way to resolve ambiguity

is to use an online dictionary or ontology; another is to compare the contexts inwhich the two words occur

In the first three decades of DBMS development-roughly from 1965 to 1995-theprimary focus had been on the management of mostly numeric business and industrial data Inthe next few decades, nonnumeric textual information will probably dominate databasecontent The text retrieval problem is becoming very relevant in the context of HTML andXML documents The web currently contains several billion of these pages Search enginesfind relevant documents given lists of words which is a case of free form natural languagequery Obtaining the corrrect result that meets the requirements of both precision (% ofretrieved documents that are relevant) and recall (%of total relevant documents that areretrieved), which are standard metrics in information retrieval, remains a challenge As aconsequence, a variety of functionalities involving comparison, conceptualization, under-standing, indexing, and summarization of documents will be added to DBMSs Multimediainformation systems promise to bring about a joining of disciplines that have historically beenseparate areas: information retrieval and database management

29.2.4 Multimedia Database Applications

Large-scale applications of multimedia databases can be expected to encompass a largenumber of disciplines and enhance existing capabilities Some important applications will

be involved:

• Documents and records management: A large number of industries and businesses keep

very detailed records and a variety of documents The data may include engineeringdesign and manufacturing data, medical records of patients, publishing material, andinsurance claim records

• Knowledge dissemination: The multimedia mode, a very effective means of knowledge

dissemination, will encompass a phenomenal growth in electronic books, catalogs,manuals, encyclopedias and repositories of information on many topics

• Education and training: Teaching materials for different audiences-from kindergarten

students to equipment operators toprofessionals-can be designed from multimediasources Digital libraries are expected to have a major influence on the way futurestudents and researchers as well as other users will access vast repositories of educa-tional material

5 A good discussion of ontologies is given in Uschold and Gruninger (1996)

Trang 11

• Marketing, advertising, retailing, entertainment, and travel:There are virtually no limits

to using multimedia information in these applications-from effective sales

presenta-tionstovirtual tours of cities and art galleries The film industry has already shown

the power of special effects in creating animations and synthetically designed

ani-mals, aliens, and special effects The use of predesigned stored objects in multimedia

databases will expand the range of these applications

• Real-time control and monitoring:Coupled with active database technology (see

Chap-ter 24), multimedia presentation of information can be a very effective means for

monitoring and controlling complex tasks such as manufacturing operations, nuclear

power plants, patients in intensive care units, and transportation systems

Commercial Systems for Multimedia Information Management. There are no

OBMSs designed for the sole purpose of multimedia data management, and therefore there

are none that have the range of functionality required to fully support all of the

multimedia information management applications that we discussed above However,

several OBMSs today support multimedia data types; these include lnformix Dynamic

Server, OB2 Universal database (UOB) of IBM, Oracle 9 and 10, CA- JASMINE, Sybase, OOB

II All of these OBMSs have support for objects, which is essential for modeling a variety of

complex multimedia objects One major problem with these systems is that the "blades,

cartridges, and extenders" for handling multimedia data are designed in a very ad hoc

manner The functionality is provided without much apparent attention to scalability and

performance There are products available that operate either stand-alone or in

conjunction with other vendors' systems to allow retrieval of image data by content They

include Virage, Excalibur, and IBM's QBIC Operations on multimedia need to be

standardized The MPEG- 7 and other standards are addressing some of these issues

29.2.5 Selected Bibliography on Multimedia Databases

Multimedia database management is becoming a very heavily researched area with

sev-eral industrial projects on the way Grosky (1994, 1997) provides two excellent

tutori-als on the topic Pazandak and Srivastava (1995) provide an evaluation of database

systems related to the requirements of multimedia databases Grosky et al (1997)

con-tains contributed articles including a survey on content-based indexing and retrieval by

]agadish (1997) Faloutsos et al (1994) also discuss a system for image querying by

con-tent.Li et al (1998) introduce image modeling in which an image is viewed as a

hierar-chical structured complex object with both semantics and visual properties Nwosu et

al (1996) and Subramanian and ]ajodia (1997) have written books on the topic

Lassila (1998) discusses the need for metadata for accessing mutimedia information on

the web; the semantic web effort is summarized in Fensel (2000) Khan (2000) did a

dissertation on ontology-based information retrieval Uschold and Gruninger (1996) is

a good resource on ontologies Corcho et al (2003) compare ontology languages and

discuss methodologies to build ontologies Multimedia content analysis, indexing, and

filtering are discussed in Dimitrova (1999) A survey of content-based multimedia

Trang 12

retrieval is provided by Yoshitaka and Ichikawa (1999) The followingWWWreferencesmay be consulted for additional information:

CA- JASMINE (Multimedia ODBMS): http://www.cai.com/products/iasmine.htmExcalibur technologies: http://www.excalib.com

Virage, Inc (Content based image retrieval): http://www.virage.comIBM's QBlC (Query by Image Content) product:

Geographic information systems (GIS) are used to collect, model, store, and analyzeinformation describing physical properties of the geographical world The scope of GISbroadly encompasses two types of data: (1) spatial data, originating from maps, digitalimages, administrative and political boundaries, roads, transportation networks; physicaldata such as rivers, soil characteristics, climatic regions, land elevations, and (2) nonspa-tial data, such as socio-economic data (like census counts), economic data, and sales ormarketing information GIS is a rapidly developing domain that offers highly innovativeapproachestomeet some challenging technical demands

29.3.1 GIS Applications

Itis possible to divide GISs into three categories: (1) cartographic applications, (2) digitalterrain modeling applications, and (3) geographic objects applications Figure 29.3summarizes these categories

Incartographic and terrain modeling applications, variations in spatial attributes arecaptured-for example, soil characteristics, crop density, and air quality Ingeographicobjects applications, objects of interest are identified from a physical domain-forexample, power plants, electoral districts, property parcels, product distribution districts,and city landmarks These objects are related with pertinent application data-whichmay be, for this specific example, power consumption, voting patterns, property salesvolumes, product sales volume, and traffic density

The first two categories of GIS applications require a field-based representation,

whereas the third category requires an object-based one The cartographic approach

involves special functions that can include the overlapping of layers of maps to combineattribute data that will allow, for example, the measuring of distances in three-dimensional space and the reclassification of data on the map Digital terrain modelingrequires a digital representation of parts of earth's surface using land elevations at samplepoints that are connected to yield a surface model such as a three-dimensional net(connected lines in 3D) showing the surface terrain.Itrequires functions of interpolationbetween observed points as well as visualization.Inobject-based geographic applications,additional spatial functions are needed to deal with data related to roads, physicalpipelines, communication cables, power lines, and such For example, for a given region,

Trang 13

29.3 Geographic Information Systems I 931

Earth science resource studies Civil engineering and military evaluation Soil surveys Air and water pollution studies Flood control Water resource management

Geographic Objects Applications

Car navigation systems Geographic market analysis

Utility distribution and consumption Consumer product and services- economic analysis

FIGURE 29.3 A possible classification of GIS applications (Adapted from Adam and

Gangopadhyay (1997))

comparable maps can be used for comparison at various points of time to show changes in

certain data such as locations of roads, cables, buildings, and streams

29.3.2 Data Management Requirements of GIS

The functional requirements of theGISapplications above translate into the following

data-base requirements

Data Modeling and Representation. GISdata can be broadly represented in two

formats: (l) vector and (2) raster Vector data represents geometric objects such as points,

lines, and polygons Thus a lake may be represented as a polygon, a river by a series of line

segments Raster data is characterized as an array of points, where each point represents the

value of an attribute for a real-world location Informally, raster images are n-dimensional

arrays where each entry is a unit of the image and represents an attribute Two-dimensional

units are calledpixels, while three-dimensional units are called voxels. Three-dimensional

elevation data is stored in a raster-based digital elevation model (OEM)format Another

ras-ter format called triangular irregular network(TIN) is a topological vector-based approach

that models surfaces by connecting sample points as vertices of triangles and has a point

density that may vary with the roughness of the terrain Rectangular grids (or elevation

Trang 14

matrices) are two-dimensional array structures In digital terrain modeling (OTM), themodel also may be used by substituting the elevation with some attribute of interest such aspopulation density or air temperature GIS data often includes a temporal structure in addi-tion to a spatial structure For example, traffic flow or average vehicular speeds in traffic may

be measured every 60 seconds at a set of points in a roadway nework

Data Analysis. GIS data undergoes various types of analysis For example, in tions such as soil erosion studies, environmental impact studies, or hydrological runoff simu-lations, OTM data may undergo various types of geomorphometric analysis-measurementssuch as slope values,gradients (the rate of change in altitude), aspect (the compass direction

applica-of the gradient),profile convexity (the rate of change of gradient), plan convexity (the

con-vexity of contours and other parameters) When GIS data is used for decision support cations, it may undergo aggregation and expansion operations using data warehousing, as

appli-we discussed in Section 28.3 In addition, geometric operations (to compute distances,areas, volumes), topological operations (to compute overlaps, intersections, shortest paths),and temporal operations (to compute internal-based or event-based queries) are involved.Analysis involves a number of temporal and spatial operations, which were discussed inChapter 24

Data Integration. GISs must integrate both vector and raster data from a variety ofsources Sometimes edges and regions are inferred from a raster image to form a vector model,

or conversely, raster images such as aerial photographs are used to update vector models eral coordinate systems such as Universal Transverse Mercator(UTM), latitude/longitude, andlocal cadastral systems are used to identify locations Data originating from different coordi-nate systems requires appropriate transformations Major public sources of geographic data,including the TIGER files maintained by U.S Department of Commerce, are used for roadmaps by many Web-based map drawing tools (e.g., http://maps.yahoo.com) Often there arehigh-accuracy, attribute-poor maps that have to be merged with low-accuracy, attribute-richmaps This is done with a process called "rubber-banding" where the user defines a set of con-trol points in both maps and the transformation of the low accuracy map is accomplished bylining up the control points A major integration issue is to create and maintain attributeinformation (such as air quality or traffic flow), which can be related to and integrated withappropriate geographical information over time as both evolve

Sev-Data Capture. The first step in developing a spatial database for cartographic ing is to capture the two-dimensional or three-dimensional geographical information in dig-ital form-a process that is sometimes impeded by source map characteristics such asresolution, type of projection, map scales, cartographic licensing, diversity of measurementtechniques, and coordinate system differences Spatial data can also be captured fromremote sensors in satellites such as Landsat, NORA, and Advanced Very High ResolutionRadiometer (AVHRR) as well as SPOT HRV (High Resolution Visible Range Instrument),which is free of interpretive bias and very accurate For digital terrain modeling, data cap-ture methods range from manual to fully automated Ground surveys are the traditionalapproach and the most accurate, but they are very time consuming Other techniquesinclude photogrammetric sampling and digitizing cartographic documents

Trang 15

model-29.3 Geographic Information Systems I 933

29.3.3 Specific GIS Data Operations

GISapplications are conducted through the use of special operators such as the following:

1 Interpolation: This process derives elevation data for points at which no samples

have been taken.Itincludes computation at single points, computation for a

rect-angular grid or along a contour, and so forth Most interpolation methods are

based on triangulation that uses the TIN method for interpolating elevations

inside the triangle based on those of its vertices

2 Interpretation: Digital terrain modeling involves the interpretation of operations

on terrain data such as editing, smoothing, reducing details, and enhancing

Additional operations involve patching or zipping the borders of triangles (in TIN

data), and merging, which implies combining overlapping models and resolving

conflicts among attribute data Conversions among grid models, contour models,

and TIN data are involved in the interpretation of the terrain

3 Proximity analysis: Several classes of proximity analysis include computations of

"zones of interest" around objects, such as the determination of a buffer around a

car on a highway Shortest path algorithms using 2D or 3D information is an

important class of proximity analysis

4 Raster image processing: This process can be divided into two categories: (1) map

algebra, which is used to integrate geographic features on different map layers to

produce new maps algebraically; and (2) digital image analysis, which deals with

analysis of a digital image for features such as edge detection and object detection

Detecting roads in a satellite image of a city is an example of the latter

5 Analysis of networks: Networks occur in GIS in many contexts that must be

ana-lyzed and may be subjected to segmentations, overlays, and so on Network overlay

refers to a type of spatial join where a given network-for example, a highway

net-work-is joined with a point database-for example, incident locations-to yield,

in this case, a profile of high-incident roadways

Other Database Functionality. The functionality of a GIS database is also subject

to other considerations

• Extensibility: GISs are required to be extensible to accommodate a variety of

con-stantly evolving applications and corresponding data types If a standard DBMS is

used, it must allow a core set of data types with a provision for defining additional

types and methods for those types

• Data quality control:As in many other applications, quality of source data is of

amount importance for providing accurate results to queries This problem is

par-ticularly significant in the GIS context because of the variety of data, sources, and

measurement techniques involved and the absolute accuracy expected by

applica-tions users

6 Visualization: A crucial function in GIS is related to visualization-the graphical

display of terrain information and the appropriate representation of application

Trang 16

attributes to go with it Major visualization techniques include (1) contouring

through the use ofisolines, spatial units of lines or arcs of equal attribute values; (2)

hillshading, an illumination method used for qualitative relief depiction using ied light intensities for individual facets of the terrain model; and (3) perspective displays, three-dimensional images of terrain model facets using perspective projec-tion methods from computer graphics These techniques impose cartographic dataand other three-dimensional objects on terrain data providing animated scene ren-derings such as those in flight simulations and animated movies

var-Such requirements clearly illustrate that standard RDBMSs or ODBMSs do not meet thespecial needs of GIS It is therefore necessary to design systems that support the vector andraster representations and the spatial functionality as well as the required DBMS features Apopular GIS software called ARC-INFO, which is not a DBMS but integrates RDBMSfunctionality in the INFO part of the system, is briefly discussed in the subsection that follows.More systems are likely to be designed in the future to work with relational or objectdatabases that will contain some of the spatial and most of the nonspatial information

ARC/INFo-a popular GIS software launched in 1981 by Environmental System ResearchInstitute (ESRr)-uses the arc node model to store spatial data A geographic layer-ealled

coverage in ARC/INFO-eonsists of three primitives: (1) nodes (points), (2) arcs (similar to

lines), and (3) polygons The arc is the most important of the three and stores a largeamount of topological information An arc has a start node and an end node (and it there-fore has direction too).Inaddition, the polygons to the left and the right of the arc are alsostored along with each arc As there is no restriction on the shape of the arc, shape pointsthat have no topological information are also stored along with each arc The databasemanaged by the INFO RDBMS thus consists of three required tables: (1) node attribute table(NAT), (2) arc attribute table (AAT), and (3) polygon attribute table (PAT) Additionalinformation can be stored in separate tables and joined with any of these three tables.The NAT contains an internal!Dfor the node, a user-specified!D,the coordinates ofthe node, and any other information associated with that node (e.g., names of theintersecting roads at the node) The AAT contains an internal !D for the are, a user-specified !D,the internal!Dof the start and end nodes, the internal!Dof the polygons tothe left and the right, a series of coordinates of shape points (if any), the length of the are,and any other data associated with the arc (e.g., the name of the road the arc represents).The PAT contains an internal ID for the polygon, a user-specified !D, the area of the

polygon, the perimeter of the polygon, and any other associated data (e.g., name of thecounty the polygon represents)

Typical spatial queries are related to adjacency, containment, and connectivity The arcnode model has enough information to satisfy all three types of queries, but the RDBMS is notideally suited for this type of querying A simple example will highlight the number of timesarelational database has to be queried to extract adjacency information Assume that we aretrying to determine whether two polygons, A and B,are adjacent to each other We wouldhave to exhaustively look at the entireAATtodetermine whether there is an edge that has A

Trang 17

29.3 Geographic Information Systems I 935

on one side and B on the other The search cannot be limited to the edges of either polygon as

we do not explicitly store all the arcs that make a polygon in the PAT Storing all the arcs in

the PAT would be redundant because all the information is already there in the AAT

ESRI has released Arc/Storm (Arc Store Manager) which allows multiple users to use

the same GIS, handles distributed databases, and integrates with other commercial

RDBMSs like ORACLE, INFORMIX, and SYBASE While it offers many performance and

functional advantages over ARC/INFO, it is essentially an RDBMS embedded within a GIS

29.3.5 Problems and Future Issues in GIS

GIS is an expanding application area of databases, reflecting an explosion in the number of

end users using digitized maps, terrain data, space images, weather data, and traffic

informa-tion support data As a consequence, an increasing number of problems related to GIS

appli-cations has been generated and will need to be solved:

1.New architectures:GIS applications will need a new client-server architecture that

will benefit from existing advances in RDBMS and ODBMS technology One

possi-ble solution is to separate spatial from nonspatial data and tomanage the latter

entirely by a DBMS Such a process calls for appropriate modeling and integration

as both types of data evolve Commercial vendors find that it is more viable to

keep a small number of independent databases with an automatic posting of

updates across them Appropriate tools for data transfer, change management, and

workflow management will be required

2 Versioningand object life-cycle approach: Because of constantly evolving

geographi-cal features, GISs must maintain elaborate cartographic and terrain data-a

man-agement problem that might be eased by incremental updating coupled with

update authorization schemes for different levels of users Under the object

life-cycle approach, which covers the activities of creating, destroying, and modifying

objects as well as promoting versions into permanent objects, a complete set of

methods may be predefined to control these activities for GIS objects

3 Data standards: Because of the diversity of representation schemes and models,

formalization of data transfer.standards is crucial for the success of GIS The

inter-national standardization body (rso Tc2l0 and the European standards body

(CEN Tc278) are now in the process of debating relevant issues-among them

conversion between vector and raster data for fast query performance

4 Matching applications and data structures: Looking again at Figure 27.5, we see that

a classification of GIS applications is based on the nature and organization of data

Inthe future, systems covering a wide range of functions-from market analysis

and utilities to car navigation-will need boundary-oriented data and

functional-ity On the other hand, applications in environmental science, hydrology, and

agriculture will require more area-oriented and terrain model data It is not clear

that all this functionality can be supported by a single general-purpose GIS The

specialized needs of GISs will require that general purpose DBMSs must be

Trang 18

enhanced with additional data types and functionality before full-fledged GISapplications can be supported.

5 Lack of semantics in data structures: This is evident especially in maps Information

such as highway and road crossings may be difficult to determine based on thestored data One-way streets are also hard to represent in the present GISs Trans-portationCADsystems have incorporated such semantics into GIS

29.3.6 Selected Bibliography for GIS

There are a number of books written on GIS Adam and Gangopadhyay (1997) and Lauriniand Thompson (1992) focus on GIS database and information management problems.Kemp (1993) gives an overview of GIS issues and data sources Huxhold (1991) gives anintruduction to Urban GIS Maguire et al (1991) have a very good collection of GIS-relatedpapers Antenucci (1998) presents a discussion of the GIS technologies Shekhar andChawla (2002) discusses issues and approaches to spatial data management which is at thecore of all GIS Demers (2002) is another recent book on the fundamentals of GIS Bosso-maier and Green (2002) is a primer on GIS operations, languages, metadata paradigms andstandards Peng and Tsou (2003) discusses Internet GIS which includes a suite of emergingnew technologies aimed at making GIS more mobile, powerful, and flexible, as well as betterable to share and communicate geographic information The TIGER files for road data in theUnited States are managed by the U.S Department of Commerce (1993) Laser-Scan'sWeb site (http://www.lsl.co.uk/papers) is a good source of information

Environmental System Research Institute (ESRI) has an excellent library of GISbooks for all levels at http://www.esri.com The GIS terminology is defined at http://www.esri.com/library/glossary/glossary.html The university of Edinburgh maintains aGIS WWW resource list at http://www.geo.ed.ac.uk/home/giswww.html

The biological sciences encompass an enormous variety of information Environmental ence gives us a view of how species live and interact in a world filled with natural phenom-ena Biology and ecology study particular species Anatomy focuses on the overall structure

sci-of an organism, documenting the physical aspects sci-of individual bodies Traditional medicineand physiology break the organism into systems and tissues and strive to collect information

on the workings of these systems and the organism as a whole Histology and cell biologydelve into the tissue and cellular levels and provide knowledge about the inner structureand function of the cell This wealth of information that has been generated, classified, andstored for centuries has only recently become a major application of database technology.Genetics has emerged as an ideal field for the application of information technology

In a broad sense, it can be thought of as the construction of models based on information

Trang 19

29.4 Genome Data Management I 937

about genes-which can be defined as basic units of heredity-and populations and the

seeking out of relationships in that information The study of genetics can be divided into

three branches: (1) Mendelian genetics, (2) molecular genetics, and (3) population

genetics Mendelian genetics is the study of the transmission of traits between

generations Molecular genetics is the study of the chemical structure and function of

genes at the molecular level Population genetics is the study of how genetic information

varies across populations of organisms

Molecular genetics provides a more detailed look at genetic information by allowing

researchers to examine the composition, structure, and function of genes The origins of

molecular genetics can be traced to two important discoveries The first occurred in 1869

when Friedrich Miescher discovered nuclein and its primary component, deoxyribonucleic

acid (DNA). In subsequent researchDNA and a related compound, ribonucleic acid (RNA),

were found to be composed of nucleotides (a sugar, a phosphate, and a base, which

combined to form nucleic acid) linked into long polymers via the sugar and phosphate The

second discovery was the demonstration in 1944 by Oswald Avery thatDNAwas indeed the

molecular substance carrying genetic information Genes were thus shown to be composed

of chains of nucleic acids arranged linearly on chromosomes and to serve three primary

functions: (1) replicating genetic information between generations, (2) providing

blueprints for the creation of polypeptides, and (3) accumulating changes-thereby

allowing evolution to occur Waston and Crick found the double-helix structure of the

DNA in 1953, which gave molecular genetics research a new direction.6Discovery of the

DNA and its structure is hailed as probably the most important biological work of the last

100 years, and the field it opened may be the scientific frontier for the next 100 In 1962,

Watson, Crick, and Wilkins won the Nobel Prize for physiology/medicine for this

breakthrough.7

29.4.2 Characteristics of Biological Data

Biological data exhibits many special characteristics that make management of biological

information a particularly challenging problem We will thus begin by summarizing the

characteristics related to biological information, and focusing on a multidisciplinary field

called bioinforrnatics that has emerged, with graduate degree programs now in place in

sev-eral universities Bioinformatics addresses information management of genetic information

with special emphasis on DNA sequence analysis It needs to be broadened into a wider

scope to harness all types of biological information-its modeling, storage, retrieval, and

management Moreover, applications of bioinformatics span design of targets for drugs,

study of mutations and related diseases, anthropological investigations on migration

pat-terms of tribes, and therapeutic treatments

Characteristic 1: Biological data is highly complex when compared with most other

domains orapplications. Definitions of such data must thus be able to represent a complex

substructure of data as well as relationships and to ensure that no information is lost

6 See Nature, 171:737 1953

7 http://www.pbs.org/wgbh/aso/databank/entries/doS3dn.html

Trang 20

during biological data modeling The structure of biological data often provides anadditional context for interpretation of the information Biological information systemsmust be able to represent any level of complexity in any data schema, relationship, orschema substructure-not just hierarchical, binary, or table data As an example,MITOMAP is a database documenting the human mitochondrial genome.f This singlegenome is a small, circular piece of DNA encompassing information about 16,569nucleotide bases; 52 gene loci encoding messenger RNA, ribosomal RNA, and transferRNA; 1000 known population variants; over 60 known disease associations; and a limitedset of knowledge on the complex molecular interactions of the biochemical energyproducing pathway of oxidative phosphorylation As might be expected, its managementhas encountered a large number of problems; we have been unable to use the traditionalRDBMSorODBMSapproches to capture all aspects of the data.

Characteristic 2: The amount and range of variabilityindataishigh. Hence, biologicalsystems must be flexible in handling data types and values With such a wide range ofpossible data values, placing constraints on data types must be limited since this mayexclude unexpected values-e.g., outlier values-that are particularly common in thebiological domain Exclusion of such values results in a loss of information In addition,frequent exceptions to biological data structures may require a choice of data types to beavailable for a given piece of data

Characteristic 3: Schemas in biological databases change at a rapid pace.Hence, forimproved information flow between generations or releases of databases, schemaevolution and data object migration must be supported The ability to extend the schema,

a frequent occurrence in the biological setting, is unsupported in most relational andobject database systems Presently systems such as GenBank rerelease the entire databasewith new schemas once or twice a year rather than incrementally changing the system aschanges become necessary Such an evolutionary database would provide a timely andorderly mechanism for following changes to individual data entities in biologicaldatabases over time This sort of tracking is important for biological researchers to be able

to access and reproduce previous results

Characteristic 4: Representations of the same data by different biologists willlikely be different (even when using the same system). Hence, mechanisms for "aligning" differentbiological schemas or different versions of schemas should be supported Given thecomplexity of biological data, there are a multitude of ways of modeling any given entity,with the results often reflecting the particular focus of the scientist While two individualsmay produce different data models if asked tointerpret the same entity, these models willlikely have numerous points in common In such situations, it would be useful tobiological investigators to be able to run queries across these common points By linkingdata elements in a network of schemas, this could be accomplished

Characteristic 5:Most users of biological datadonot require write access to the database; read-only access is adequate. Write access is limited to privileged users calledcurators. Forexample, the database created as part of theMITOMAPproject has on average more than

8 Details ofMITOMAPand its information complexity can be seen in Kogelniket al.(1997, 1998)and at http://www mitomap.org

Định dạng
Số trang	40
Dung lượng	1,69 MB