DATABASE SYSTEMS (phần 22) pdf

A key difference between structured and semistructured data concerns how theschema constructs such as the names of attributes, relationships, and entity types arehandled.. There are two

Trang 1

Distributed database design has been addressed in terms of horizontal and verticalfragmentation, allocation, and replication Ceri et a1 (1982) defined the concept ofminterm horizontal fragments Ceri et a1 (1983) developed an integer programmingbased optimization model for horizontal fragmentation and allocation N avathe et'11.(1984) developed algorithms for vertical fragmentation based on attribute affinity andshowed a variety of contexts for vertical fragment allocation Wilson and Navathe (1986)present an analytical model for optimal allocation of fragments Elmasri et a1 (1987)discuss fragmentation for the EeR model; Karlapalem et a1 (1994) discuss issues fordistributed design of object databases Navathe et a1 (1996) discuss mixed fragmentation

by combining horizontal and vertical fragmentation; Karlapalem et a1 (1996) present amodel for redesign of distributed databases

Distributed query processing, optimization, and decomposition are discussed inHevner and Yao (1979), Kerschberg et a1 (1982), Apers et a1 (1983), Ceri and Pelagatti(1984), and Bodorick et a1 (1992) Bernstein and Goodman (1981) discuss the theorybehind semijoin processing Wong (1983) discusses the use of relationships in relationfragmentation Concurrency control and recovery schemes are discussed in Bernstein andGoodman (1981a) Kumar and Hsu (1998) have some articles related to recovery indistributed databases Elections in distributed systems are discussed in Garcia-Molina(1982) Lamport (1978) discusses problems with generating unique timestamps in adistributed system

A concurrency control technique for replicated data that is based on voting ispresented by Thomas (1979) Gifford (1979) proposes the use of weighted voting, andParis (1986) describes a method called voting with witnesses ]ajodia and Mutchler(1990) discuss dynamic voting A technique calledavailable copyis proposed by Bernsteinand Goodman (1984), and one that uses the idea of a group is presented in EIAbbadi andToueg (1988) Other recent work that discusses replicated data includes Gladney (1989),Agrawal and E1Abbadi (1990), E1Abbadi and Toueg (1990), Kumar and Segev (1993),Mukkamala (1989), and Wolfson and Milo (1991) Bassiouni (1988) discusses optimisticprotocols for DDB concurrency control Garcia-Molina (1983) and Kumar andStonebraker (1987) discuss techniques that use the semantics of the transactions.Distributed concurrency control techniques based on locking and distinguished copies arepresented by Menasce et a1 (1980) and Minoura and Wiederhold (1982) Obermark(1982) presents algorithms for distributed deadlock detection

A survey of recovery techniques in distributed systems is given by Kohler (1981).Reed (1983) discusses atomic actions on distributed data A book edited by Bhargava(1987) presents various approaches and techniques for concurrency and reliability indistributed systems

Federated database systems were first defined in McLeod and Heimbigner (1985).Techniques for schema integration in federated databases are presented by Elmasri et al.(1986), Batini et a1 (1986), Hayne and Ram (1990), and Motro (1987) Elmagarmid andHelal (1988) and Gamal-Eldin et a1 (1988) discuss the update problem in heterogeneousDDBSs Heterogeneous distributed database issues are discussed in Hsiao and Kamel(1989) Sheth and Larson (1990) present an exhaustive survey of federated databasemanagement

Trang 2

Selected Bibliography I 837

Recently, multidatabase systems and interoperability have become important topics

Techniques for dealing with semantic incompatibilities among multiple databases are

examined in DeMichiel (1989), Siegel and Madnick (1991), Krishnamurthy et al

(1991), and Wang and Madnick (1989) Castano et al (1998) present an excellent

survey of techniques for analysis of schemas Pitoura et al (1995) discuss object

orientation in multidatabase systems

Transaction processing in multidatabases is discussed in Mehrotra et al (1992),

Georgakopoulos et al (1991), Elmagarmid et al (1990), and Brietbart et al (1990),

among others Elmagarmid et al (1992) discuss transaction processing for advanced

applications, including engineering applications discussed in Heiler et a1 (1992)

The workflow systems, which are becoming popular to manage information in

complex organizations, use multilevel and nested transactions in conjunction with

distributed databases Weikum (1991) discusses multilevel transaction management

Alonso et al (1997) discuss limitations of current workflow systems

A number of experimental distributed DBMSs have been implemented These include

distributed INGRES (Epstein et al., 1978), DDTS (Devor and Weeldreyer, 1980), SDD-l

(Rothnie et al., 1980), System R* (Lindsay et al., 1984), SIRIUS-DELTA (Ferrier and

Stangret, 1982), and MULTIBASE (Smith et al., 1981) The OMNIBASE system

(Rusinkiewicz et al., 1988) and the Federated Information Base developed using the

Candide data model (Navathe et al., 1994) are examples of federated DDBMS Pitoura et al

(1995) present a comparative survey of the federated database system prototypes Most

commercial DBMS vendors have products using the client-server approach and offer

distributed versions of their systems Some system issues concerning client-server DBMS

architectures are discussed in Carey et al (1991), DeWitt et al (1990), and Wang and

Rowe (1991) Khoshafian et al (1992) discuss design issues for relational DBMSs in the

client-server environment Client-server management issues are discussed in many books,

such as Zantinge and Adriaans (1996)

Trang 3

EMERGING TECHNOLOGIES

Trang 4

XML and Internet Databases

We now turn our attention to how databases are used and accessed from the Internet

Many electronic commerce (e-commerce) and other Internet applications provide Web

interfaces to access information stored in one or more databases These databases are

often referred to as data sources It is common to use two-tier and three-tier clientserver

architectures for Internet applications (see Section 2.5) In some cases, other variations of

the clientserver model are used E-commerce and other Internet database applications are

designed to interact with the user through Web interfaces that display Web pages The

common method of specifying the contents and formatting of Web pages is through the

use of hyperlink documents There are various languages for writing these documents,

the most common beingHTML(Hypertext Markup Language) AlthoughHTMLis widely

used for formatting and structuring Web documents, it is not suitable for specifying

struc-tured data that is extracted from databases Recently, a new language-namely, XML

(Extended Markup Language)-has emerged as the standard for structuring and

exchang-ing data over the Web XML can be used to provide information about the structure and

meaning of the data in the Web pages rather than just specifying how the Web pages are

formatted for display on the screen The formatting aspects are specified separately-for

example, by using a formatting language such asXSL(Extended Stylesheet Language)

This chapter describes the basics of accessing and exchanging information over the

Internet We start in Section 26.1 by discussing how traditional Web pages differ from

structured databases, and discuss the differences between structured, semistructured, and

unstructured data Then in Section 26.2 we turn our attention to theXML standard and

841

Trang 5

its tree-structured (hierarchical) data model Section 26.3 discussesXMLdocuments andthe languages for specifying the structure of these documents, namely, XML DTD

(Document Type Definition) and XML schema Section 26.4 presents the variousapproaches for storing XML documents, whether in their native (text) format, in acompressed form, or in relational and other types of databases Section 26.5 gives anoverview of the languages proposed for queryingXMLdata Section 26.6 summarizes thechapter

26.1 STRUCTURED, SEMISTRUCTURED, AND

UNSTRUCTURED DATA

The information stored in databases is known as structured data because it is represented

in a strict format For example, each record in a relational database table-such as the

EMPLOYEE table in Figure S.6-follows the same format as the other records in that table.For structured data, it is common to carefully design the database using techniques such asthose described in Chapters 3, 4, 7, 10, and 11 in order to create the database schema.TheDBMSthen checks to ensure that all data follows the structures and constraints spec-ified in the schema

However, not all data is collected and inserted into carefully designed structureddatabases In some applications, data is collected in an ad-hoc manner before it is knownhow it will be stored and managed This data may have a certain structure, but not all theinformation collected will have identical structure Some attributes may be shared amongthe various entities, but other attributes may exist only in a few entities Moreover,additional attributes can be introduced in some of the newer data items at any time, andthere is no predefined schema This type of data is known as semistructured data Anumber of data models have been introduced for representing semistructured data, oftenbased on using tree or graph data structures rather than the flat relational model structures

A key difference between structured and semistructured data concerns how theschema constructs (such as the names of attributes, relationships, and entity types) arehandled In semistructured data, the schema information ismixedin with the data values,since each data object can have different attributes that are not known in advance.Hence, this type of data is sometimes referred to as self-describing data Consider thefollowing example We want to collect a list of bibliographic references related to acertain research project Some of these may be books or technical reports, others may beresearch articles in journals or conference proceedings, and still others may refer tocomplete journal issues or conference proceedings Clearly, each of these may havedifferent attributes and different types of information Even for the same type ofreference-say, conference articles-we may have different information For example,one article citation may be quite complete, with full information about author names,title, proceedings, page numbers, and so on, whereas another citation may not have allthe information available New types of bibliographic sources may appear in the future-for example, referencestoWeb pages ortoconference tutorials-and these may have newattributes that describe them

Trang 6

26.1 Structured, Semistructured, and Unstructured Data I 843

FIGURE 26.1 Representing semistructured data as a graph

Semistructured data may be displayed as a directed graph, as shown in Figure 26.1

The information shown in Figure 26.1 corresponds to some of the structured data shown

in Figure 5.6 As we can see, this model somewhat resembles the object model (see Figure

20.1) in its ability to represent complex objects and nested structures In Figure 26.1, the

labels or tags on the directed edges represent the schema names: thenames of attributes,

object types (or entity typesor classes), and relationships. The internal nodes represent

individual objects or composite attributes The leaf nodes represent actual data values of

simple (atomic) attributes

There are two main differences between the semistructured model and the object

model that we discussed in Chapter 20:

1.The schema information-names of attributes, relationships, and classes (object

types) in the semistructured model is intermixed with the objects and their data

values in the same data structure

2 In the semistructured model, there is no requirement for a predefined schema to

which the data objects must conform

In addition to structured and semistructured data, a third category exists, known as

unstructured data because there is very limited indication of the type of data A typical

example is a text document that contains information embedded within it Web pages in

HTML that contain some data are considered to be unstructured data Consider part of

an HTMLfile, shown in Figure 26.2 Text that appears between angled brackets, < >, is

an HTMLtag A tag with a backslash, «] >, indicates an end tag, which represents the

Trang 7

<head>

</head>

<body>

<H1>List of company projects and the employees in each project<\H1>

<H2>The ProductX project:</H2>

<TR>

<TO width="50%"><font size="2" face="Arial">John Smith:</font></TO>

<TO>32.5 hours per week</TO>

</TR>

<TR>

<TO width="50%%"><font size="2" face="Arial">Joyce English:</font></TO>

<TO>20.0 hours per week</TD>

</TR>

</table>

<H2>The ProductY project:</H2>

<TR>

<TO width="50%"><font size="2" face="Arial">John Smith:</font></TO>

</TR>

<TR>

<TO width="50%%"><font size="2" face="Arial">Joyce English:</font></TO>

</TR>

<TR>

<TO width="50%%"><font size="2" face="Arial">Franklin Wong:</font></TO>

</TR>

</table>

</body>

</html>

FIGURE 26.2 Part of an HTML document representing unstructured data

ending of the effect of a matching start tag The tags mark up the document! in order toinstruct an HTML processor howto display the text between a start tag and a matchingend tag Hence, the tags specify document formatting rather than the meaning of thevarious data elements in the document.HTMLtags specify information, such as font sizeand style (boldface, italics, and so on), color, heading levels in documents, and so on.Some tags provide text structuring in documents, such as specifying a numbered or

1 That is why it is known as HypertextMarkupLanguage

Trang 8

26.1 Structured, Semistructured, and Unstructured Data I 845

unnumbered list or a table Even these structuring tags specify that the embedded textual

data is to be displayed in a certain manner, rather than indicating the type of data

represented in the table

HTML uses a large number of predefined tags, which are used to specify a variety of

commands for formatting Web documents for display The start and end tags specify the

range of text to be formatted by each command A few examples of the tags shown in

Figure 26.2 follow:

• The <html> </html> tags specify the boundaries of the document

• The document header information-within the <head> </head> tags-specifies

various commands that will be used elsewhere in the document For example, it may

specify various script functions in a language such asJAVAScript orPERL,or certain

formatting styles (fonts, paragraph styles, header styles, and so on) that can be used

in the document Itcan also specify a title to indicate what theHTMLfile is for, and

other similar information that will not be displayed as part of the document

• The body of the document-specified within the <body> </body> tags-includes

the document text and the markup tags that specify how the text is to be formatted

and displayed It can also include references to other objects, such as images, videos,

voice messages, and other documents

• The <HI> </HI> tags specify that the text is to be displayed as a level I heading

There are many heading levels «H2>, <H3>, and so on), each displaying text in a

less prominent heading format

• The <table> </table> tags specify that the following text is to be displayed as a

table Each row in the table is enclosed within <TR> </TR> tags, and the actual

text data in a row is displayed within <TD> </TD> tags.2

• Some tags may have attributes, which appear within the start tag and describe

addi-tional properties of the tag." In Figure 26.2, the <table> start tag has four attributes

describing various characteristics of the table The following <TD> and <font> start

tags have one and two attributes, respectively

HTML has a very large number of predefined tags, and whole books are devoted to

describing how to use these tags If designed properly,HTMLdocuments can be formatted

so that humans are able to easily understand the document contents, and are able to

navigate through the resulting Web documents However, the source HTML text

documents are very difficult tointerpret automatically bycomputer programsbecause they

do not include schema information about the type of data in the documents As

e-commerce and other Internet applications become increasingly automated, it is becoming

crucial to be able to exchange Web documents among various computer sites and to

interpret their contents automatically This need was one of the reasons that led to the

development ofXML, which we discuss in the next section

2 <TR> stands for table row, and <TO> for table data

3 This is how the termattributeis used in document markup languages, which differs from how it is

used in database models

Trang 9

26.2 XMl HIERARCHICAL (TREE) DATA MODEL

We now introduce the data model used inXML.The basic object isXMLin theXMLment Two main structuring concepts are used to construct an XMLdocument: elementsand attributes.Itis importanttonote right away that the term attribute inXMLis not used

docu-in the same manner as is customary docu-in database termdocu-inology, but rather as it is used docu-indocument description languages such as HTML and SGML.4 Attributes in XML provideadditional information that describes elements, as we shall see There are additional con-cepts in XML,such as entities, identifiers, and references, but we first concentrate ondescribing elements and attributestoshow the essence of theXMLmodel

Figure 26.3 shows an example of an XML element called <projects> As in HTML,

elements are identified in a document by their start tag and end tag The tag names areenclosed between angled brackets < >, and end tags are further identified by abackslash, </ >.5Complex elements are constructed from other elements hierarchically,whereas simple elements contain data values A major difference betweenXMLandHTML

is that XML tag names are defined to describe the meaning of the data elements in thedocument, rather than to describe how the text is to be displayed This makes it possible

to process the data elements in theXMLdocument automatically by computer programs

Itis straightforward to see the correspondence between theXMLtextual representationshown in Figure 26.3 and the tree structure shown in Figure 26.1 In the tree representation,internal nodes represent complex elements, whereas leaf nodes represent simple elements.That is why theXMLmodel is called a tree model or a hierarchical model In Figure 26.3,the simple elements are the ones with the tag names <Name>, <Number>, <Location>,

<DeptNo>, <SSN>, <LastName>, <FirstName>, and <hours> The complex elements arethe ones with the tag names <projects>, <project>, and <Worker> In general, there is nolimit on the levels of nesting of elements

In general, it is possible to characterize three main types ofXMLdocuments:

• Data-centricXMLdocuments: These documents have many small data items that Iowa specific structure and hence may be extracted from a structured database Theyare formatted asXMLdocuments in ordertoexchange them or display them over theWeb

fol-• Document-centricXMLdocuments: These are documents with large amounts of text,such as news articles or books There are few or no structured data elements in thesedocuments

• HybridXMLdocuments: These documents may have parts that contain structured dataand other parts that are predominantly textual or unstructured

It is importanttonote that data-centricXMLdocuments can be considered either assemistructured data or as structured data If an XMLdocument conforms to a predefined

4.SGML(Standard Generalized Markup Language) is a more general language for describing ments and provides capabilities for specifying new tags However, it is more complex thanHTMLand XML.

docu-5 The left and right angled bracket characters« and» are reserved characters, as are the sand (&), apostrophee),and single quotation marks (') To include them within the text of a doc-ument, they must be encoded as &It;, >, &, ', and ", respectively

Trang 10

amper-26.2 XML Hierarchical (Tree) Data Model I 847

FIGURE 26.3 A complexXMLelement called <projects>

XML schema or DTD (see Section 26.3), then the document can be considered as

structureddata. On the other hand, XML allows documents that do not conform to any

schema; and these would be considered assemistructureddata.The latter are also known as

schemaless XML documents When the value of the STANDALONEattribute in an XML document

is"YES",as in the first line of Figure 26.3, the document is standalone and schemaless

XML attributes are generally used in a manner similartohow they are used in HTML

(see Figure 26.2), namely,todescribe properties and characteristics of the elements (tags)

within which they appear It is also possible to use XML attributes tohold the values of

Trang 11

simple data elements; however this is definitely not recommended We discuss XML

attributes further in Section 26.3 when we discussXMLschema andDTD

26.3 XML DOCUMENTS, DTD, AND XML SCHEMA

26.3.1 Well-Formed and Valid XML Documents and XML DTD

In Figure 26.3, we saw what a simple XMLdocument may look like AnXMLdocument iswell formed if it follows a few conditions In particular, it must start with anXMLdeclara-tionto indicate the version ofXMLbeing used as well as any other relevant attributes, asshown in the first line of Figure 26.3 Itmust also follow the syntactic guidelines of thetree model This means that there should be asingle root element,and every element mustinclude a matching pair of start and end tags within the start and end tagsof the parent element.This ensures that the nested elements specify a well-formed tree structure

A well-formedXMLdocument is syntactically correct This allows it to be processed

by generic processors that traverse the document and create an internal treerepresentation A standard set of API (application programming interface) functionscalledDOM(Document Object Model) allows programs to manipulate the resulting treerepresentation corresponding to a well-formed XML document However, the wholedocument must be parsed beforehand when using DOM.Another APIcalledSAXallowsprocessing ofXMLdocuments on the fly by notifying the processing program whenever astart or end tag is encountered This makes it easier to process large documents and allowsfor processing of so-called streamingXMLdocuments, where the processing program canprocess the tags as they are encountered

A well-formedXML document can have any tag names for the elements within thedocument There is no predefined set of elements (tag names) that a program processingthe document knows to expect This gives the document creator the freedom to specifynew elements, but limits the possibilities for automatically interpreting the elementswithin the document

<!DOCTYPE projects [

<!ELEMENT projects (project+»

<!ELEMENT project (Name, Number, Location, DeptNo?, Workers»

<!ELEMENT Name (#PCDATA»

<!ELEMENT Number (#PCDATA»

<!ELEMENT Location (#PCDATA»

<!ELEMENT DeptNo (#PCDATA»

<!ELEMENT Workers (Worker*»

<!ELEMENT Worker (SSN, LastName?, FirstName?, hours»

<!ELEMENT SSN (#PCDATA»

<!ELEMENT LastName (#PCDATA»

<!ELEMENT FirstName (#PCDATA»

<!ELEMENT hours (#PCDATA»

] >

FIGURE 26.4 AnXML DTDfile called projects

Trang 12

26.3 XML Documents, DTD, and XMLSchema I 849

A stronger criterion is for an XML document to be valid In this case, the document

must be well formed, and in addition the element names used in the start and end tag

pairs must follow the structure specified in a separate XML DTD (Document Type

Definition) file or XMLschema file We first discussXML DTDhere, then give an overview

ofXMLschema in Section 26.3.2 Figure 26.4 shows a simpleXML DTDfile, which specifies

the elements (tag names) and their nested structures Any valid documents conforming

to this DTD should follow the specified structure A special syntax exists for specifying

DTD files, as illustrated in Figure 26.4 First, a name is given to the root tag of the

document, which is called projects in the first line of Figure 26.4 Then the elements and

their nested structure are specified

When specifying elements, the following notation is used:

• A *following the element name means that the element can be repeated zero or

more times in the document This kind of element is known as anoptional multivalued

(repeating) element.

• A + following the element name means that the element can be repeated one or

more times in the document This kind of element is arequired multivalued (repeating)

element.

• A ?following the element name means that the element can be repeated zero or one

times This kind is an optional single-valued (nonrepeating) element.

• An element appearing without any of the preceding three symbols must appear

exactly once in the document This kind is a required single-valued (nonrepeating)

element.

• The type of the element is specified via parentheses following the element If the

parentheses include names of other elements, these latter elements are the childrenof

the element in the tree structure If the parentheses include the keyword #PCDATA or

one of the other data types available inXML DTD, the element is a leaf node PCDATA

stands forparsed characterdata,which is roughly similar to a string data type

• Parentheses can be nested when specifying elements

• A bar symbol(e\ Iez )specifies that eithere\orezcan appear in the document

We can see that the tree structure in Figure 26.1 and theXML document in Figure

26.3 conform to the XML DTD in Figure 26.4 To require that an XML document be

checked for conformance to a DTD, we must specify this in the declaration of the

document For example, we could change the first line in Figure 26.3 to the following:

<?xml version="1.0" standalone="no"?>

<!DOCTYPE projects SYSTEM "proj.dtd">

When the value of the standalone attribute in an XML document is "no", the

document needs to be checked against a separateDTDdocument TheDTDfile shown in

Figure 26.4 should be stored in the same file system as theXML document, and should be

given the file name "proj dtd" Alernatively, we could include theDTD document text

at the beginning of theXMLdocument itself to allow the checking

Although XML DTD is quite adequate for specifying tree structures with required,

optional, and repeating elements, it has several limitations First, the data types in DTD

Trang 13

are not very general Second,DTDhas its own special syntax and thus requires specializedprocessors Itwould be advantageous to specifyXMLschema documents using the syntaxrules ofXMLitself so that the same processors used forXMLdocuments could processXML

schema descriptions Third, all DTDelements are always forced to follow the specifiedordering of the document, so unordered elements are not permitted These drawbacks led

to the development ofXMLschema, a more general language for specifying the structureand elements ofXMLdocuments

26.3.2 XML Schema

TheXMLschema language is a standard for specifying the structure ofXMLdocuments Ituses the same syntax rules as regularXMLdocuments, so that the same processors can beused on both To distinguish the two types of documents, we will use the term XML

instance documentorXML documentfor a regularXMLdocument, andXML schema document

for a document that specifies an XML schema Figure 26.5 shows an XML schema ment correspondingtothe COMPANYdatabase shown in Figures 3.2 and 5.5 Although it isunlikely that we would want to display the whole database as a single document, therehave been proposals to store data in nativeXMLformat as an alternative to storing thedata in relational databases The schema in Figure 26.5 would serve the purpose of speci-fying the structure of theCOMPANYdatabase if it were stored in a nativeXMLsystem We dis-cuss this topic further in Section 26.4

docu-As withXML DTD, XMLschema is based on the tree data model, with elements andattributes as the main structuring concepts However, it borrows additional concepts from

<7xml version="l.O" encoding="UTF-8" 7>

<xsd:schema xmlns:xsd=''http://www.w3.org/2001/XMLSchema''>

<xsd:annotation>

<xsd:documentation xml:lang="en">Company Schema (Element Approach)

-Prepared by Babak Hojabri</xsd:documentation>

Trang 14

26.3 XML Documents, DTD, and XML Schema I 851

Trang 15

<xsd:complexType name="Department">

<xsd:sequence>

<xsd:element name="departmentName" type="xsd:string" />

<xsd:element name="departmentNumber" type="xsd:string" />

<xsd:element name="departmentManagerSSN" type="xsd:string" />

<xsd:element name="departmentManagerStartDate" type="xsd:date" />

<xsd:element name="departmentLocation" type="xsd:string"

<xsd:element name="employeeName" type="Name" />

<xsd:element name="employeeSSN" type="xsd:string" />

<xsd:element name="employeeSex" type="xsd:string" />

<xsd:element name="employeeSalary" type="xsd:unsignedlnt" />

<xsd:element name="employeeBirthDate" type="xsd:date" />

<xsd:element name="employeeDepartmentNumber" type="xsd:string" />

<xsd:element name="employeeSupervisorSSN" type="xsd:string" />

<xsd:element name="employeeAddress" type="Address" />

<xsd:element name="employeeWorksOn" type="WorksOn" m;nOccurs="I"maxOccurs="unbounded" />

<xsd:element name="employeeDependent" type="Dependent" m;nOccurs="O"maxOccurs="unbounded" />

</xsd:sequence>

</xsd:complexType>

<xsd:complexType name="Project">

<xsd:sequence>

<xsd:element name="projectName" type="xsd:string" />

<xsd:element name="projectNumber" type="xsd:string" />

<xsd:element name="projectLocat;on" type="xsd:string" />

<xsd:element name="projectDepartmentNumber" type="xsd:string" />

<xsd:element name="projectWorker" type="Worker" m;nOccurs="I"maxOccurs="unbounded" />

</xsd:sequence>

</xsd:complexType>

<xsd:complexType name="Dependent">

<xsd:sequence>

<xsd:element name="dependentName" type="xsd:string" />

<xsd:element name="dependentSex" type="xsd:string" />

<xsd:element name="dependentBirthDate" type="xsd:date" />

<xsd:element name="dependentRelationship" type="xsd:string" />

</xsd:sequence>

</xsd:complexType>

<xsd:complexType name="Address">

<xsd:sequence>

<xsd:element name="number" type="xsd:string" />

<xsd:element name="street" type="xsd:string" />

<xsd:element name="city" type="xsd:string" />

<xsd:element name="state" type="xsd:string" />

</xsd:sequence>

FIGURE 26.5(CONTINUED) An XMLschema file called company

Trang 16

26.3 XMLDocuments, DTD, and XMLSchema I 853

</xsd:complexType>

<xsd:complexType name="Name">

<xsd:sequence>

<xsd:element name="firstName" type="xsd:string" />

<xsd:element name="middleName" type="xsd:string" />

<xsd:element name="lastName" type="xsd:string" />

</xsd:sequence>

</xsd:complexType>

<xsd:complexType name="Worker">

<xsd:sequence>

<xsd:element name="SSN" type="xsd:string" />

<xsd:element name="hours" type="xsd:float" />

</xsd:sequence>

</xsd:complexType>

<xsd:complexType name="WorksOn">

<xsd:sequence>

<xsd:element name="projectNumber" type="xsd:string" />

<xsd:element name="hours" type="xsd:float" />

</xsd:sequence>

</xsd:complexType>

</xsd:schema>

FIGURE 26.5(CONTINUED) An XMLschema file called company

database and object models, such as keys, references, and identifiers We here describe the

features of XML schema in a step-by-step manner, referring to the example XML schema

document of Figure 26.5 for illustration We introduce and describe some of the schema

concepts in the order in which they are used in Figure 26.5

1.Schema descriptions and XML namespaces: Itis necessarytoidentify the specific set

ofXML schema language elements (tags) being used by specifying a file stored at a

Web site location The second line in Figure 26.5 specifies the file used in this

example, which is http://www.w3.org/200l/XMLSchema" This is the most

commonly used standard for XML schema commands Each such definition is

called an XML namespace, because it defines the set of commands (names) that

can be used The file name is assigned to the variable xsd (XML schema

descrip-tion) using the attribute xml ns (XML narnespace}, and this variable is used as a

prefix to all XML schema commands (tag names) For example, in Figure 26.5,

when we write xsd: el ement or xsd: sequence, we are referringtothe definitions

of the element and sequence tags as defined in the file ''http://www.w3.org/

200l/XMLSchema"

2 Annotations, documentation, and language used:The next couple of lines in Figure

26.5 illustrate the XML schema elements (tags) xsd: annotati on and

xsd: documentati on, which are used for providing comments and other

descrip-tions in the XML document The attribute xml : 1ang of the xsd:documentati on

element specifies the language being used, where "en" stands for the English

language

Trang 17

3 Elements and types: Next, we specify theroot elementof ourXMLschema InXML

schema, the name attribute of the xsd: element tag specifies the element name,which is called company for the root element in our example (see Figure 26.5).The structure of the company root element can then be specified, which in ourexample is xsd: complexType This is further specified to be a sequence of depart-ments, employees, and projects using the xsd: sequence structure ofXMLschema

Itis important to note here that this is not the only way to specify anXMLschemafor theCOMPANYdatabase We will discuss other options in Section 26.4

4 First-level elements in theCOMPANYdatabase:Next, we specify the three first-level ments under the company root element in Figure 26.5 These elements are namedemployee, department, and proj ect, and each is specified in an xsd: element tag.Notice that if a tag has only attributes and no further subelements or data within

ele-it, it can be ended with the backslash symbol C/» directly instead of having aseparate matching end tag These are called empty elements; examples are thexsd: el ement elements named department and project in Figure 26.5

5 Specifying element type andminimumand maximum occurrences:InXMLschema, theattributes type, minOccu rs , and maxOccurs in the xsd: element tag specify thetype and multiplicity of each element in any document that conforms to theschema specifications If we specify a type attribute in an xsd: element, the struc-ture of the element must be described separately, typically using thexsd : comp1exType element of XMLschema This is illustrated by the employee,department, and project elements in Figure 26.5 On the other hand, if no typeattribute is specified, the element structure can be defined directly following thetag, as illustrated by the company root element in Figure 26.5 The mi nOccurs andmaxOccurs tags are used for specifying lower and upper bounds on the number ofoccurrences of an element in any document that conforms to the schema specifi-cations If they are not specified, the default is exactly one occurrence Theseserve a similar role tothe ", +,and? symbols ofXML DTD,and to the (min, max)constraints of theERmodel (see Section 3.7.4)

6 Specifying keys:In XMLschema, it is possible to specify constraints that correspond

to unique and primary key constraints in a relational database (see Section 5.2.2),

as well as foreign keys (or referential integrity) constraints (see Section 5.2,4).The xsd: uni que tag specifies elements that correspond to unique attributes in arelational database that are not primary keys We can give each such uniquenessconstraint a name, and we must specify xsd: sel ector and xsd: fi e1d tags for it

to identify the element type that contains the unique element and the elementname within it that is unique via the xpath attribute This is illustrated by thedepartmentNameUni que and proj ectNameUni que elements in Figure 26.5 Forspecifying primary keys, the tag xsd: key is used instead of xsd: uni que, as illus-trated by the projectNumberKey, departmentNumberKey, and employeeSSNKeyelements in Figure 26.5 For specifying foreign keys, the tag xsd: keyref is used,

as illustrated by the six xsd: key ref elements in Figure 26.5 When specifying aforeign key, the attribute refer of the xsd: key ref tag specifies the referencedprimary key, whereas the tags xsd: se1ector and xsd: fi e1d specify the referenc-ing element type and foreign key (see Figure 26.5)

Trang 18

26.4 XML Documents and Databases I 855

7 Specifying the structures of complex elements via complex types:The next part of our

example specifies the structures of the complex elements Department, Employee,

Project, and Dependent, using the tag xsd:complexType (see Figure 26.5) We

specify each of these as a sequence of subelements corresponding to the database

attributes of each entity type (see Figures 3.2 and 5.7) by using the xsd: sequence

and xsd: element tags ofXMLschema Each element is given a name and type via

the attributes name and type of xsd: element We can also specify mi nOccurs and

maxOccu rs attributes if we need to change the default of exactly one occurrence

For (optional) database attributes where null is allowed, we need to specify

mi nOccurs = 0, whereas for multivalued database attributes we need to specify

maxOccurs = "unbounded" on the corresponding element Notice that if we were

not going to specify any key constraints, we could have embedded the subelernents

within the parent element definitions directly without having to specify complex

types However, when unique, primary key, and foreign key constraints need to be

specified, we must define complex types to specify the element structures

8 Composite (compound) attributes: Composite attributes from Figure 3.2 are also

specified as complex types in Figure 26.5, as illustrated by the Address, Name,

Worker, and WorksOn complex types These could have been directly embedded

within their parent elements

This example illustrates some of the main features ofXMLschema There are other

features, but they are beyond the scope of our presentation In the next section, we discuss

the different approaches to creatingXMLdocuments from relational databases and storing

XMLdocuments

26.4 XML DOCUMENTS AND DATABASES

We now discuss how various types ofXMLdocuments can be stored and retrieved Section

26.4.1 gives an overview of the various approaches for storingXMLdocuments Section

26.4.2 discusses one of these approaches, in which data-centric XMLdocuments are

extracted from existing databases, in more detail In particular, we show how tree

struc-tured documents can be created from graph-strucstruc-tured databases Section 26.4.3 discusses

the problem of cycles and how it can be dealt with

26.4.1 Approaches to Storing XML Documents

Several approaches to organizing the contents ofXMLdocumentstofacilitate their

subse-quent querying and retrieval have been proposed The following are the most common

approaches:

1.Using a DBMS to store the documents as text: A relational or object DBMScan be

used to store whole XMLdocuments as text fields within the DBMS records or

objects This approach can be used if theDBMShas a special module for document

processing, and would work for storing schemaless and document-centric XML

Trang 19

documents The keyword indexing functions of the document processing module(see Chapter 22) can be used to index and speed up search and retrieval of thedocuments.

2 Using aDBMS to store the document contents as data elements: This approach would

work for storing a collection of documents that follow a specificXML DTDorXML

schema Because all the documents have the same structure, one can design arelational (or object) database to store the leaf-level data elements within the

XMLdocuments This approach would require mapping algorithms to design adatabase schema that is compatible with theXMLdocument structure as specified

in the XMLschema or DTDand to recreate the XMLdocuments from the storeddata These algorithms can be implemented either as an internalDBMSmodule or

as separate middleware that is not part of theDBMS

3 Designing a specialized system for storing nativeXMLdata: A new type of database

system based on the hierarchical (tree) model could be designed and mented The system would include specialized indexing and querying techniques,and would work for all types ofXMLdocuments It could also include data com-pression techniques to reduce the size of the documents for storage

imple-4 Creatingorpublishing customizedXMLdocuments from preexisting relational databases:

Because there are enormous amounts of data already stored in relational bases, parts of this data may need to be formatted as documents for exchanging ordisplaying over the Web This approach would use a separate middleware softwarelayertohandle the conversions needed between theXMLdocuments and the rela-tional database

data-All four of these approaches have received considerable attention over the past fewyears We focus on approach 4 in the next subsection, because it gives a good conceptualunderstanding of the differences between the XML tree data model and the traditionaldatabase models based on flat files (relational model) and graph representations (ER

Trang 20

26.4 XML Documents and Databases I 857

We will use the simplifiedUNIVERSITY ERschema shown in Figure 26.6 to illustrate our

discussion Suppose that an application needs to extract XMLdocuments for student,

course, and grade information from the UNIVERSITY database The data needed for these

documents is contained in the database attributes of the entity types COURSE, SECTION, and

STUDENTfrom Figure 26.6, and the relationships s-s and c-s between them In general,

most documents extracted from a database will only use a subset of the attributes, entity

types, and relationships in the database In this example, the subset of the database that is

needed is shown in Figure 26.7

0ections taught

FIGURE 26.6 AnERschema diagram for a simplified UNIVERSITYdatabase

~

Students attended

~ course

Định dạng
Số trang	40
Dung lượng	1,48 MB