A key difference between structured and semistructured data concerns how theschema constructs such as the names of attributes, relationships, and entity types arehandled.. There are two
Trang 1Distributed database design has been addressed in terms of horizontal and verticalfragmentation, allocation, and replication Ceri et a1 (1982) defined the concept ofminterm horizontal fragments Ceri et a1 (1983) developed an integer programmingbased optimization model for horizontal fragmentation and allocation N avathe et'11.(1984) developed algorithms for vertical fragmentation based on attribute affinity andshowed a variety of contexts for vertical fragment allocation Wilson and Navathe (1986)present an analytical model for optimal allocation of fragments Elmasri et a1 (1987)discuss fragmentation for the EeR model; Karlapalem et a1 (1994) discuss issues fordistributed design of object databases Navathe et a1 (1996) discuss mixed fragmentation
by combining horizontal and vertical fragmentation; Karlapalem et a1 (1996) present amodel for redesign of distributed databases
Distributed query processing, optimization, and decomposition are discussed inHevner and Yao (1979), Kerschberg et a1 (1982), Apers et a1 (1983), Ceri and Pelagatti(1984), and Bodorick et a1 (1992) Bernstein and Goodman (1981) discuss the theorybehind semijoin processing Wong (1983) discusses the use of relationships in relationfragmentation Concurrency control and recovery schemes are discussed in Bernstein andGoodman (1981a) Kumar and Hsu (1998) have some articles related to recovery indistributed databases Elections in distributed systems are discussed in Garcia-Molina(1982) Lamport (1978) discusses problems with generating unique timestamps in adistributed system
A concurrency control technique for replicated data that is based on voting ispresented by Thomas (1979) Gifford (1979) proposes the use of weighted voting, andParis (1986) describes a method called voting with witnesses ]ajodia and Mutchler(1990) discuss dynamic voting A technique calledavailable copyis proposed by Bernsteinand Goodman (1984), and one that uses the idea of a group is presented in EIAbbadi andToueg (1988) Other recent work that discusses replicated data includes Gladney (1989),Agrawal and E1Abbadi (1990), E1Abbadi and Toueg (1990), Kumar and Segev (1993),Mukkamala (1989), and Wolfson and Milo (1991) Bassiouni (1988) discusses optimisticprotocols for DDB concurrency control Garcia-Molina (1983) and Kumar andStonebraker (1987) discuss techniques that use the semantics of the transactions.Distributed concurrency control techniques based on locking and distinguished copies arepresented by Menasce et a1 (1980) and Minoura and Wiederhold (1982) Obermark(1982) presents algorithms for distributed deadlock detection
A survey of recovery techniques in distributed systems is given by Kohler (1981).Reed (1983) discusses atomic actions on distributed data A book edited by Bhargava(1987) presents various approaches and techniques for concurrency and reliability indistributed systems
Federated database systems were first defined in McLeod and Heimbigner (1985).Techniques for schema integration in federated databases are presented by Elmasri et al.(1986), Batini et a1 (1986), Hayne and Ram (1990), and Motro (1987) Elmagarmid andHelal (1988) and Gamal-Eldin et a1 (1988) discuss the update problem in heterogeneousDDBSs Heterogeneous distributed database issues are discussed in Hsiao and Kamel(1989) Sheth and Larson (1990) present an exhaustive survey of federated databasemanagement
Trang 2Selected Bibliography I 837
Recently, multidatabase systems and interoperability have become important topics
Techniques for dealing with semantic incompatibilities among multiple databases are
examined in DeMichiel (1989), Siegel and Madnick (1991), Krishnamurthy et al
(1991), and Wang and Madnick (1989) Castano et al (1998) present an excellent
survey of techniques for analysis of schemas Pitoura et al (1995) discuss object
orientation in multidatabase systems
Transaction processing in multidatabases is discussed in Mehrotra et al (1992),
Georgakopoulos et al (1991), Elmagarmid et al (1990), and Brietbart et al (1990),
among others Elmagarmid et al (1992) discuss transaction processing for advanced
applications, including engineering applications discussed in Heiler et a1 (1992)
The workflow systems, which are becoming popular to manage information in
complex organizations, use multilevel and nested transactions in conjunction with
distributed databases Weikum (1991) discusses multilevel transaction management
Alonso et al (1997) discuss limitations of current workflow systems
A number of experimental distributed DBMSs have been implemented These include
distributed INGRES (Epstein et al., 1978), DDTS (Devor and Weeldreyer, 1980), SDD-l
(Rothnie et al., 1980), System R* (Lindsay et al., 1984), SIRIUS-DELTA (Ferrier and
Stangret, 1982), and MULTIBASE (Smith et al., 1981) The OMNIBASE system
(Rusinkiewicz et al., 1988) and the Federated Information Base developed using the
Candide data model (Navathe et al., 1994) are examples of federated DDBMS Pitoura et al
(1995) present a comparative survey of the federated database system prototypes Most
commercial DBMS vendors have products using the client-server approach and offer
distributed versions of their systems Some system issues concerning client-server DBMS
architectures are discussed in Carey et al (1991), DeWitt et al (1990), and Wang and
Rowe (1991) Khoshafian et al (1992) discuss design issues for relational DBMSs in the
client-server environment Client-server management issues are discussed in many books,
such as Zantinge and Adriaans (1996)
Trang 3EMERGING TECHNOLOGIES
Trang 4XML and Internet Databases
We now turn our attention to how databases are used and accessed from the Internet
Many electronic commerce (e-commerce) and other Internet applications provide Web
interfaces to access information stored in one or more databases These databases are
often referred to as data sources It is common to use two-tier and three-tier clientserver
architectures for Internet applications (see Section 2.5) In some cases, other variations of
the clientserver model are used E-commerce and other Internet database applications are
designed to interact with the user through Web interfaces that display Web pages The
common method of specifying the contents and formatting of Web pages is through the
use of hyperlink documents There are various languages for writing these documents,
the most common beingHTML(Hypertext Markup Language) AlthoughHTMLis widely
used for formatting and structuring Web documents, it is not suitable for specifying
struc-tured data that is extracted from databases Recently, a new language-namely, XML
(Extended Markup Language)-has emerged as the standard for structuring and
exchang-ing data over the Web XML can be used to provide information about the structure and
meaning of the data in the Web pages rather than just specifying how the Web pages are
formatted for display on the screen The formatting aspects are specified separately-for
example, by using a formatting language such asXSL(Extended Stylesheet Language)
This chapter describes the basics of accessing and exchanging information over the
Internet We start in Section 26.1 by discussing how traditional Web pages differ from
structured databases, and discuss the differences between structured, semistructured, and
unstructured data Then in Section 26.2 we turn our attention to theXML standard and
841
Trang 5its tree-structured (hierarchical) data model Section 26.3 discussesXMLdocuments andthe languages for specifying the structure of these documents, namely, XML DTD
(Document Type Definition) and XML schema Section 26.4 presents the variousapproaches for storing XML documents, whether in their native (text) format, in acompressed form, or in relational and other types of databases Section 26.5 gives anoverview of the languages proposed for queryingXMLdata Section 26.6 summarizes thechapter
26.1 STRUCTURED, SEMISTRUCTURED, AND
UNSTRUCTURED DATA
The information stored in databases is known as structured data because it is represented
in a strict format For example, each record in a relational database table-such as the
EMPLOYEE table in Figure S.6-follows the same format as the other records in that table.For structured data, it is common to carefully design the database using techniques such asthose described in Chapters 3, 4, 7, 10, and 11 in order to create the database schema.TheDBMSthen checks to ensure that all data follows the structures and constraints spec-ified in the schema
However, not all data is collected and inserted into carefully designed structureddatabases In some applications, data is collected in an ad-hoc manner before it is knownhow it will be stored and managed This data may have a certain structure, but not all theinformation collected will have identical structure Some attributes may be shared amongthe various entities, but other attributes may exist only in a few entities Moreover,additional attributes can be introduced in some of the newer data items at any time, andthere is no predefined schema This type of data is known as semistructured data Anumber of data models have been introduced for representing semistructured data, oftenbased on using tree or graph data structures rather than the flat relational model structures
A key difference between structured and semistructured data concerns how theschema constructs (such as the names of attributes, relationships, and entity types) arehandled In semistructured data, the schema information ismixedin with the data values,since each data object can have different attributes that are not known in advance.Hence, this type of data is sometimes referred to as self-describing data Consider thefollowing example We want to collect a list of bibliographic references related to acertain research project Some of these may be books or technical reports, others may beresearch articles in journals or conference proceedings, and still others may refer tocomplete journal issues or conference proceedings Clearly, each of these may havedifferent attributes and different types of information Even for the same type ofreference-say, conference articles-we may have different information For example,one article citation may be quite complete, with full information about author names,title, proceedings, page numbers, and so on, whereas another citation may not have allthe information available New types of bibliographic sources may appear in the future-for example, referencestoWeb pages ortoconference tutorials-and these may have newattributes that describe them
Trang 626.1 Structured, Semistructured, and Unstructured Data I 843
FIGURE 26.1 Representing semistructured data as a graph
Semistructured data may be displayed as a directed graph, as shown in Figure 26.1
The information shown in Figure 26.1 corresponds to some of the structured data shown
in Figure 5.6 As we can see, this model somewhat resembles the object model (see Figure
20.1) in its ability to represent complex objects and nested structures In Figure 26.1, the
labels or tags on the directed edges represent the schema names: thenames of attributes,
object types (or entity typesor classes), and relationships. The internal nodes represent
individual objects or composite attributes The leaf nodes represent actual data values of
simple (atomic) attributes
There are two main differences between the semistructured model and the object
model that we discussed in Chapter 20:
1.The schema information-names of attributes, relationships, and classes (object
types) in the semistructured model is intermixed with the objects and their data
values in the same data structure
2 In the semistructured model, there is no requirement for a predefined schema to
which the data objects must conform
In addition to structured and semistructured data, a third category exists, known as
unstructured data because there is very limited indication of the type of data A typical
example is a text document that contains information embedded within it Web pages in
HTML that contain some data are considered to be unstructured data Consider part of
an HTMLfile, shown in Figure 26.2 Text that appears between angled brackets, < >, is
an HTMLtag A tag with a backslash, «] >, indicates an end tag, which represents the
Trang 7<head>
</head>
<body>
<H1>List of company projects and the employees in each project<\H1>
<H2>The ProductX project:</H2>
<table width="100%" border=O cellpadding=O cellspacing=O>
<TR>
<TO width="50%"><font size="2" face="Arial">John Smith:</font></TO>
<TO>32.5 hours per week</TO>
</TR>
<TR>
<TO width="50%%"><font size="2" face="Arial">Joyce English:</font></TO>
<TO>20.0 hours per week</TD>
</TR>
</table>
<H2>The ProductY project:</H2>
<table width="100%" border=O cellpadding=O cellspacing=O>
<TR>
<TO width="50%"><font size="2" face="Arial">John Smith:</font></TO>
<TO>7.5 hours per week</TO>
</TR>
<TR>
<TO width="50%%"><font size="2" face="Arial">Joyce English:</font></TO>
<TO>20.0 hours per week</TO>
</TR>
<TR>
<TO width="50%%"><font size="2" face="Arial">Franklin Wong:</font></TO>
<TO>10.0 hours per week</TO>
</TR>
</table>
</body>
</html>
FIGURE 26.2 Part of an HTML document representing unstructured data
ending of the effect of a matching start tag The tags mark up the document! in order toinstruct an HTML processor howto display the text between a start tag and a matchingend tag Hence, the tags specify document formatting rather than the meaning of thevarious data elements in the document.HTMLtags specify information, such as font sizeand style (boldface, italics, and so on), color, heading levels in documents, and so on.Some tags provide text structuring in documents, such as specifying a numbered or
1 That is why it is known as HypertextMarkupLanguage
Trang 826.1 Structured, Semistructured, and Unstructured Data I 845
unnumbered list or a table Even these structuring tags specify that the embedded textual
data is to be displayed in a certain manner, rather than indicating the type of data
represented in the table
HTML uses a large number of predefined tags, which are used to specify a variety of
commands for formatting Web documents for display The start and end tags specify the
range of text to be formatted by each command A few examples of the tags shown in
Figure 26.2 follow:
• The <html> </html> tags specify the boundaries of the document
• The document header information-within the <head> </head> tags-specifies
various commands that will be used elsewhere in the document For example, it may
specify various script functions in a language such asJAVAScript orPERL,or certain
formatting styles (fonts, paragraph styles, header styles, and so on) that can be used
in the document Itcan also specify a title to indicate what theHTMLfile is for, and
other similar information that will not be displayed as part of the document
• The body of the document-specified within the <body> </body> tags-includes
the document text and the markup tags that specify how the text is to be formatted
and displayed It can also include references to other objects, such as images, videos,
voice messages, and other documents
• The <HI> </HI> tags specify that the text is to be displayed as a level I heading
There are many heading levels «H2>, <H3>, and so on), each displaying text in a
less prominent heading format
• The <table> </table> tags specify that the following text is to be displayed as a
table Each row in the table is enclosed within <TR> </TR> tags, and the actual
text data in a row is displayed within <TD> </TD> tags.2
• Some tags may have attributes, which appear within the start tag and describe
addi-tional properties of the tag." In Figure 26.2, the <table> start tag has four attributes
describing various characteristics of the table The following <TD> and <font> start
tags have one and two attributes, respectively
HTML has a very large number of predefined tags, and whole books are devoted to
describing how to use these tags If designed properly,HTMLdocuments can be formatted
so that humans are able to easily understand the document contents, and are able to
navigate through the resulting Web documents However, the source HTML text
documents are very difficult tointerpret automatically bycomputer programsbecause they
do not include schema information about the type of data in the documents As
e-commerce and other Internet applications become increasingly automated, it is becoming
crucial to be able to exchange Web documents among various computer sites and to
interpret their contents automatically This need was one of the reasons that led to the
development ofXML, which we discuss in the next section
2 <TR> stands for table row, and <TO> for table data
3 This is how the termattributeis used in document markup languages, which differs from how it is
used in database models
Trang 926.2 XMl HIERARCHICAL (TREE) DATA MODEL
We now introduce the data model used inXML.The basic object isXMLin theXMLment Two main structuring concepts are used to construct an XMLdocument: elementsand attributes.Itis importanttonote right away that the term attribute inXMLis not used
docu-in the same manner as is customary docu-in database termdocu-inology, but rather as it is used docu-indocument description languages such as HTML and SGML.4 Attributes in XML provideadditional information that describes elements, as we shall see There are additional con-cepts in XML,such as entities, identifiers, and references, but we first concentrate ondescribing elements and attributestoshow the essence of theXMLmodel
Figure 26.3 shows an example of an XML element called <projects> As in HTML,
elements are identified in a document by their start tag and end tag The tag names areenclosed between angled brackets < >, and end tags are further identified by abackslash, </ >.5Complex elements are constructed from other elements hierarchically,whereas simple elements contain data values A major difference betweenXMLandHTML
is that XML tag names are defined to describe the meaning of the data elements in thedocument, rather than to describe how the text is to be displayed This makes it possible
to process the data elements in theXMLdocument automatically by computer programs
Itis straightforward to see the correspondence between theXMLtextual representationshown in Figure 26.3 and the tree structure shown in Figure 26.1 In the tree representation,internal nodes represent complex elements, whereas leaf nodes represent simple elements.That is why theXMLmodel is called a tree model or a hierarchical model In Figure 26.3,the simple elements are the ones with the tag names <Name>, <Number>, <Location>,
<DeptNo>, <SSN>, <LastName>, <FirstName>, and <hours> The complex elements arethe ones with the tag names <projects>, <project>, and <Worker> In general, there is nolimit on the levels of nesting of elements
In general, it is possible to characterize three main types ofXMLdocuments:
• Data-centricXMLdocuments: These documents have many small data items that Iowa specific structure and hence may be extracted from a structured database Theyare formatted asXMLdocuments in ordertoexchange them or display them over theWeb
fol-• Document-centricXMLdocuments: These are documents with large amounts of text,such as news articles or books There are few or no structured data elements in thesedocuments
• HybridXMLdocuments: These documents may have parts that contain structured dataand other parts that are predominantly textual or unstructured
It is importanttonote that data-centricXMLdocuments can be considered either assemistructured data or as structured data If an XMLdocument conforms to a predefined
4.SGML(Standard Generalized Markup Language) is a more general language for describing ments and provides capabilities for specifying new tags However, it is more complex thanHTMLand XML.
docu-5 The left and right angled bracket characters« and» are reserved characters, as are the sand (&), apostrophee),and single quotation marks (') To include them within the text of a doc-ument, they must be encoded as &It;, >, &, ', and ", respectively
Trang 10amper-26.2 XML Hierarchical (Tree) Data Model I 847
FIGURE 26.3 A complexXMLelement called <projects>
XML schema or DTD (see Section 26.3), then the document can be considered as
structureddata. On the other hand, XML allows documents that do not conform to any
schema; and these would be considered assemistructureddata.The latter are also known as
schemaless XML documents When the value of the STANDALONEattribute in an XML document
is"YES",as in the first line of Figure 26.3, the document is standalone and schemaless
XML attributes are generally used in a manner similartohow they are used in HTML
(see Figure 26.2), namely,todescribe properties and characteristics of the elements (tags)
within which they appear It is also possible to use XML attributes tohold the values of
Trang 11simple data elements; however this is definitely not recommended We discuss XML
attributes further in Section 26.3 when we discussXMLschema andDTD
26.3 XML DOCUMENTS, DTD, AND XML SCHEMA
26.3.1 Well-Formed and Valid XML Documents and XML DTD
In Figure 26.3, we saw what a simple XMLdocument may look like AnXMLdocument iswell formed if it follows a few conditions In particular, it must start with anXMLdeclara-tionto indicate the version ofXMLbeing used as well as any other relevant attributes, asshown in the first line of Figure 26.3 Itmust also follow the syntactic guidelines of thetree model This means that there should be asingle root element,and every element mustinclude a matching pair of start and end tags within the start and end tagsof the parent ele- ment.This ensures that the nested elements specify a well-formed tree structure
A well-formedXMLdocument is syntactically correct This allows it to be processed
by generic processors that traverse the document and create an internal treerepresentation A standard set of API (application programming interface) functionscalledDOM(Document Object Model) allows programs to manipulate the resulting treerepresentation corresponding to a well-formed XML document However, the wholedocument must be parsed beforehand when using DOM.Another APIcalledSAXallowsprocessing ofXMLdocuments on the fly by notifying the processing program whenever astart or end tag is encountered This makes it easier to process large documents and allowsfor processing of so-called streamingXMLdocuments, where the processing program canprocess the tags as they are encountered
A well-formedXML document can have any tag names for the elements within thedocument There is no predefined set of elements (tag names) that a program processingthe document knows to expect This gives the document creator the freedom to specifynew elements, but limits the possibilities for automatically interpreting the elementswithin the document
<!DOCTYPE projects [
<!ELEMENT projects (project+»
<!ELEMENT project (Name, Number, Location, DeptNo?, Workers»
<!ELEMENT Name (#PCDATA»
<!ELEMENT Number (#PCDATA»
<!ELEMENT Location (#PCDATA»
<!ELEMENT DeptNo (#PCDATA»
<!ELEMENT Workers (Worker*»
<!ELEMENT Worker (SSN, LastName?, FirstName?, hours»
<!ELEMENT SSN (#PCDATA»
<!ELEMENT LastName (#PCDATA»
<!ELEMENT FirstName (#PCDATA»
<!ELEMENT hours (#PCDATA»
] >
FIGURE 26.4 AnXML DTDfile called projects
Trang 1226.3 XML Documents, DTD, and XMLSchema I 849
A stronger criterion is for an XML document to be valid In this case, the document
must be well formed, and in addition the element names used in the start and end tag
pairs must follow the structure specified in a separate XML DTD (Document Type
Definition) file or XMLschema file We first discussXML DTDhere, then give an overview
ofXMLschema in Section 26.3.2 Figure 26.4 shows a simpleXML DTDfile, which specifies
the elements (tag names) and their nested structures Any valid documents conforming
to this DTD should follow the specified structure A special syntax exists for specifying
DTD files, as illustrated in Figure 26.4 First, a name is given to the root tag of the
document, which is called projects in the first line of Figure 26.4 Then the elements and
their nested structure are specified
When specifying elements, the following notation is used:
• A *following the element name means that the element can be repeated zero or
more times in the document This kind of element is known as anoptional multivalued
(repeating) element.
• A + following the element name means that the element can be repeated one or
more times in the document This kind of element is arequired multivalued (repeating)
element.
• A ?following the element name means that the element can be repeated zero or one
times This kind is an optional single-valued (nonrepeating) element.
• An element appearing without any of the preceding three symbols must appear
exactly once in the document This kind is a required single-valued (nonrepeating)
element.
• The type of the element is specified via parentheses following the element If the
parentheses include names of other elements, these latter elements are the childrenof
the element in the tree structure If the parentheses include the keyword #PCDATA or
one of the other data types available inXML DTD, the element is a leaf node PCDATA
stands forparsed characterdata,which is roughly similar to a string data type
• Parentheses can be nested when specifying elements
• A bar symbol(e\ Iez )specifies that eithere\orezcan appear in the document
We can see that the tree structure in Figure 26.1 and theXML document in Figure
26.3 conform to the XML DTD in Figure 26.4 To require that an XML document be
checked for conformance to a DTD, we must specify this in the declaration of the
document For example, we could change the first line in Figure 26.3 to the following:
<?xml version="1.0" standalone="no"?>
<!DOCTYPE projects SYSTEM "proj.dtd">
When the value of the standalone attribute in an XML document is "no", the
document needs to be checked against a separateDTDdocument TheDTDfile shown in
Figure 26.4 should be stored in the same file system as theXML document, and should be
given the file name "proj dtd" Alernatively, we could include theDTD document text
at the beginning of theXMLdocument itself to allow the checking
Although XML DTD is quite adequate for specifying tree structures with required,
optional, and repeating elements, it has several limitations First, the data types in DTD
Trang 13are not very general Second,DTDhas its own special syntax and thus requires specializedprocessors Itwould be advantageous to specifyXMLschema documents using the syntaxrules ofXMLitself so that the same processors used forXMLdocuments could processXML
schema descriptions Third, all DTDelements are always forced to follow the specifiedordering of the document, so unordered elements are not permitted These drawbacks led
to the development ofXMLschema, a more general language for specifying the structureand elements ofXMLdocuments
26.3.2 XML Schema
TheXMLschema language is a standard for specifying the structure ofXMLdocuments Ituses the same syntax rules as regularXMLdocuments, so that the same processors can beused on both To distinguish the two types of documents, we will use the term XML
instance documentorXML documentfor a regularXMLdocument, andXML schema document
for a document that specifies an XML schema Figure 26.5 shows an XML schema ment correspondingtothe COMPANYdatabase shown in Figures 3.2 and 5.5 Although it isunlikely that we would want to display the whole database as a single document, therehave been proposals to store data in nativeXMLformat as an alternative to storing thedata in relational databases The schema in Figure 26.5 would serve the purpose of speci-fying the structure of theCOMPANYdatabase if it were stored in a nativeXMLsystem We dis-cuss this topic further in Section 26.4
docu-As withXML DTD, XMLschema is based on the tree data model, with elements andattributes as the main structuring concepts However, it borrows additional concepts from
<7xml version="l.O" encoding="UTF-8" 7>
<xsd:schema xmlns:xsd=''http://www.w3.org/2001/XMLSchema''>
<xsd:annotation>
<xsd:documentation xml:lang="en">Company Schema (Element Approach)
-Prepared by Babak Hojabri</xsd:documentation>
Trang 1426.3 XML Documents, DTD, and XML Schema I 851
Trang 15<xsd:complexType name="Department">
<xsd:sequence>
<xsd:element name="departmentName" type="xsd:string" />
<xsd:element name="departmentNumber" type="xsd:string" />
<xsd:element name="departmentManagerSSN" type="xsd:string" />
<xsd:element name="departmentManagerStartDate" type="xsd:date" />
<xsd:element name="departmentLocation" type="xsd:string"
<xsd:element name="employeeName" type="Name" />
<xsd:element name="employeeSSN" type="xsd:string" />
<xsd:element name="employeeSex" type="xsd:string" />
<xsd:element name="employeeSalary" type="xsd:unsignedlnt" />
<xsd:element name="employeeBirthDate" type="xsd:date" />
<xsd:element name="employeeDepartmentNumber" type="xsd:string" />
<xsd:element name="employeeSupervisorSSN" type="xsd:string" />
<xsd:element name="employeeAddress" type="Address" />
<xsd:element name="employeeWorksOn" type="WorksOn" m;nOccurs="I"maxOccurs="unbounded" />
<xsd:element name="employeeDependent" type="Dependent" m;nOccurs="O"maxOccurs="unbounded" />
</xsd:sequence>
</xsd:complexType>
<xsd:complexType name="Project">
<xsd:sequence>
<xsd:element name="projectName" type="xsd:string" />
<xsd:element name="projectNumber" type="xsd:string" />
<xsd:element name="projectLocat;on" type="xsd:string" />
<xsd:element name="projectDepartmentNumber" type="xsd:string" />
<xsd:element name="projectWorker" type="Worker" m;nOccurs="I"maxOccurs="unbounded" />
</xsd:sequence>
</xsd:complexType>
<xsd:complexType name="Dependent">
<xsd:sequence>
<xsd:element name="dependentName" type="xsd:string" />
<xsd:element name="dependentSex" type="xsd:string" />
<xsd:element name="dependentBirthDate" type="xsd:date" />
<xsd:element name="dependentRelationship" type="xsd:string" />
</xsd:sequence>
</xsd:complexType>
<xsd:complexType name="Address">
<xsd:sequence>
<xsd:element name="number" type="xsd:string" />
<xsd:element name="street" type="xsd:string" />
<xsd:element name="city" type="xsd:string" />
<xsd:element name="state" type="xsd:string" />
</xsd:sequence>
FIGURE 26.5(CONTINUED) An XMLschema file called company
Trang 1626.3 XMLDocuments, DTD, and XMLSchema I 853
</xsd:complexType>
<xsd:complexType name="Name">
<xsd:sequence>
<xsd:element name="firstName" type="xsd:string" />
<xsd:element name="middleName" type="xsd:string" />
<xsd:element name="lastName" type="xsd:string" />
</xsd:sequence>
</xsd:complexType>
<xsd:complexType name="Worker">
<xsd:sequence>
<xsd:element name="SSN" type="xsd:string" />
<xsd:element name="hours" type="xsd:float" />
</xsd:sequence>
</xsd:complexType>
<xsd:complexType name="WorksOn">
<xsd:sequence>
<xsd:element name="projectNumber" type="xsd:string" />
<xsd:element name="hours" type="xsd:float" />
</xsd:sequence>
</xsd:complexType>
</xsd:schema>
FIGURE 26.5(CONTINUED) An XMLschema file called company
database and object models, such as keys, references, and identifiers We here describe the
features of XML schema in a step-by-step manner, referring to the example XML schema
document of Figure 26.5 for illustration We introduce and describe some of the schema
concepts in the order in which they are used in Figure 26.5
1.Schema descriptions and XML namespaces: Itis necessarytoidentify the specific set
ofXML schema language elements (tags) being used by specifying a file stored at a
Web site location The second line in Figure 26.5 specifies the file used in this
example, which is http://www.w3.org/200l/XMLSchema" This is the most
commonly used standard for XML schema commands Each such definition is
called an XML namespace, because it defines the set of commands (names) that
can be used The file name is assigned to the variable xsd (XML schema
descrip-tion) using the attribute xml ns (XML narnespace}, and this variable is used as a
prefix to all XML schema commands (tag names) For example, in Figure 26.5,
when we write xsd: el ement or xsd: sequence, we are referringtothe definitions
of the element and sequence tags as defined in the file ''http://www.w3.org/
200l/XMLSchema"
2 Annotations, documentation, and language used:The next couple of lines in Figure
26.5 illustrate the XML schema elements (tags) xsd: annotati on and
xsd: documentati on, which are used for providing comments and other
descrip-tions in the XML document The attribute xml : 1ang of the xsd:documentati on
element specifies the language being used, where "en" stands for the English
language
Trang 173 Elements and types: Next, we specify theroot elementof ourXMLschema InXML
schema, the name attribute of the xsd: element tag specifies the element name,which is called company for the root element in our example (see Figure 26.5).The structure of the company root element can then be specified, which in ourexample is xsd: complexType This is further specified to be a sequence of depart-ments, employees, and projects using the xsd: sequence structure ofXMLschema
Itis important to note here that this is not the only way to specify anXMLschemafor theCOMPANYdatabase We will discuss other options in Section 26.4
4 First-level elements in theCOMPANYdatabase:Next, we specify the three first-level ments under the company root element in Figure 26.5 These elements are namedemployee, department, and proj ect, and each is specified in an xsd: element tag.Notice that if a tag has only attributes and no further subelements or data within
ele-it, it can be ended with the backslash symbol C/» directly instead of having aseparate matching end tag These are called empty elements; examples are thexsd: el ement elements named department and project in Figure 26.5
5 Specifying element type andminimumand maximum occurrences:InXMLschema, theattributes type, minOccu rs , and maxOccurs in the xsd: element tag specify thetype and multiplicity of each element in any document that conforms to theschema specifications If we specify a type attribute in an xsd: element, the struc-ture of the element must be described separately, typically using thexsd : comp1exType element of XMLschema This is illustrated by the employee,department, and project elements in Figure 26.5 On the other hand, if no typeattribute is specified, the element structure can be defined directly following thetag, as illustrated by the company root element in Figure 26.5 The mi nOccurs andmaxOccurs tags are used for specifying lower and upper bounds on the number ofoccurrences of an element in any document that conforms to the schema specifi-cations If they are not specified, the default is exactly one occurrence Theseserve a similar role tothe ", +,and? symbols ofXML DTD,and to the (min, max)constraints of theERmodel (see Section 3.7.4)
6 Specifying keys:In XMLschema, it is possible to specify constraints that correspond
to unique and primary key constraints in a relational database (see Section 5.2.2),
as well as foreign keys (or referential integrity) constraints (see Section 5.2,4).The xsd: uni que tag specifies elements that correspond to unique attributes in arelational database that are not primary keys We can give each such uniquenessconstraint a name, and we must specify xsd: sel ector and xsd: fi e1d tags for it
to identify the element type that contains the unique element and the elementname within it that is unique via the xpath attribute This is illustrated by thedepartmentNameUni que and proj ectNameUni que elements in Figure 26.5 Forspecifying primary keys, the tag xsd: key is used instead of xsd: uni que, as illus-trated by the projectNumberKey, departmentNumberKey, and employeeSSNKeyelements in Figure 26.5 For specifying foreign keys, the tag xsd: keyref is used,
as illustrated by the six xsd: key ref elements in Figure 26.5 When specifying aforeign key, the attribute refer of the xsd: key ref tag specifies the referencedprimary key, whereas the tags xsd: se1ector and xsd: fi e1d specify the referenc-ing element type and foreign key (see Figure 26.5)
Trang 1826.4 XML Documents and Databases I 855
7 Specifying the structures of complex elements via complex types:The next part of our
example specifies the structures of the complex elements Department, Employee,
Project, and Dependent, using the tag xsd:complexType (see Figure 26.5) We
specify each of these as a sequence of subelements corresponding to the database
attributes of each entity type (see Figures 3.2 and 5.7) by using the xsd: sequence
and xsd: element tags ofXMLschema Each element is given a name and type via
the attributes name and type of xsd: element We can also specify mi nOccurs and
maxOccu rs attributes if we need to change the default of exactly one occurrence
For (optional) database attributes where null is allowed, we need to specify
mi nOccurs = 0, whereas for multivalued database attributes we need to specify
maxOccurs = "unbounded" on the corresponding element Notice that if we were
not going to specify any key constraints, we could have embedded the subelernents
within the parent element definitions directly without having to specify complex
types However, when unique, primary key, and foreign key constraints need to be
specified, we must define complex types to specify the element structures
8 Composite (compound) attributes: Composite attributes from Figure 3.2 are also
specified as complex types in Figure 26.5, as illustrated by the Address, Name,
Worker, and WorksOn complex types These could have been directly embedded
within their parent elements
This example illustrates some of the main features ofXMLschema There are other
features, but they are beyond the scope of our presentation In the next section, we discuss
the different approaches to creatingXMLdocuments from relational databases and storing
XMLdocuments
26.4 XML DOCUMENTS AND DATABASES
We now discuss how various types ofXMLdocuments can be stored and retrieved Section
26.4.1 gives an overview of the various approaches for storingXMLdocuments Section
26.4.2 discusses one of these approaches, in which data-centric XMLdocuments are
extracted from existing databases, in more detail In particular, we show how tree
struc-tured documents can be created from graph-strucstruc-tured databases Section 26.4.3 discusses
the problem of cycles and how it can be dealt with
26.4.1 Approaches to Storing XML Documents
Several approaches to organizing the contents ofXMLdocumentstofacilitate their
subse-quent querying and retrieval have been proposed The following are the most common
approaches:
1.Using a DBMS to store the documents as text: A relational or object DBMScan be
used to store whole XMLdocuments as text fields within the DBMS records or
objects This approach can be used if theDBMShas a special module for document
processing, and would work for storing schemaless and document-centric XML
Trang 19documents The keyword indexing functions of the document processing module(see Chapter 22) can be used to index and speed up search and retrieval of thedocuments.
2 Using aDBMS to store the document contents as data elements: This approach would
work for storing a collection of documents that follow a specificXML DTDorXML
schema Because all the documents have the same structure, one can design arelational (or object) database to store the leaf-level data elements within the
XMLdocuments This approach would require mapping algorithms to design adatabase schema that is compatible with theXMLdocument structure as specified
in the XMLschema or DTDand to recreate the XMLdocuments from the storeddata These algorithms can be implemented either as an internalDBMSmodule or
as separate middleware that is not part of theDBMS
3 Designing a specialized system for storing nativeXMLdata: A new type of database
system based on the hierarchical (tree) model could be designed and mented The system would include specialized indexing and querying techniques,and would work for all types ofXMLdocuments It could also include data com-pression techniques to reduce the size of the documents for storage
imple-4 Creatingorpublishing customizedXMLdocuments from preexisting relational databases:
Because there are enormous amounts of data already stored in relational bases, parts of this data may need to be formatted as documents for exchanging ordisplaying over the Web This approach would use a separate middleware softwarelayertohandle the conversions needed between theXMLdocuments and the rela-tional database
data-All four of these approaches have received considerable attention over the past fewyears We focus on approach 4 in the next subsection, because it gives a good conceptualunderstanding of the differences between the XML tree data model and the traditionaldatabase models based on flat files (relational model) and graph representations (ER
Trang 2026.4 XML Documents and Databases I 857
We will use the simplifiedUNIVERSITY ERschema shown in Figure 26.6 to illustrate our
discussion Suppose that an application needs to extract XMLdocuments for student,
course, and grade information from the UNIVERSITY database The data needed for these
documents is contained in the database attributes of the entity types COURSE, SECTION, and
STUDENTfrom Figure 26.6, and the relationships s-s and c-s between them In general,
most documents extracted from a database will only use a subset of the attributes, entity
types, and relationships in the database In this example, the subset of the database that is
needed is shown in Figure 26.7
0ections taught
FIGURE 26.6 AnERschema diagram for a simplified UNIVERSITYdatabase
~
Students attended
~ course