Tài liệu Querying XML, : XQuery, XPath, and SQL/XML in context pptx

The Morgan Kaufmann Series in Data Management Systems Series Editor: Jim Gray, Microsoft Research Querying XML: XQuery, XPath, and SQL/XML in Context Jim Melton and Stephen Buxton Da

Trang 2

Querying XML XQuery, XPath, and SQUXML

Trang 3

The Morgan Kaufmann Series in Data Management Systems

Series Editor: Jim Gray, Microsoft Research

Querying XML: XQuery, XPath, and SQL/XML in

Context

Jim Melton and Stephen Buxton

Data Mining: Concepts and Techniques, Second

Edition

Jiawei Han and Micheline Kamber

Database Modeling and Design: Logical Design,

Joe Celko's SQL for Smarties: Advanced SQL

Programming, Third Edition

Joe Celko

Moving Objects Databases

Ralf Hartmut G~iting and Markus Schneider

Joe Celko's SQL Programming Style

Joe Celko

Data Mining, Second Edition: Concepts and

Techniques

Ian Witten and Eibe Frank

Fuzzy Modeling and Genetic Algorithms for Data

Mining and Exploration

Earl Cox

Data Modeling Essentials, Third Edition

Graeme C Simsion and Graham C Witt

Transactional Information Systems: Theory, Algorithms, and Practice of Concurrency Control and Recovery

Gerhard Weikum and Gottfried Vossen

Spatial Databases: ~ t h Application to GIS

Philippe Rigaux, Michel Scholl, and Agnes Voisard

Information Modeling and Relational Databases:

From Conceptual Analysis to Logical Design

Terry Halpin

Component Database Systems

Edited by Klaus R Dittrich and Andreas Geppert

Managing Reference Data in Enterprise Databases:

Binding Corporate Data to the Wider World

Malcolm Chisholm

Understanding SQL and Java Together: A Guide to SQLJ, JDBC, and Related Technologies

Jim Melton and Andrew Eisenberg

Database: Principles, Programming, and Performance, Second Edition

Patrick and Elizabeth O'Neil

The Object Data Standard: ODMG 3.0

Edited by R G G Cattell and Douglas K Barry

Data on the Web: From Relations to Semistructured Data and XML

Serge Abiteboul, Peter Buneman, and Dan Suciu

Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations

Ian Witten and Eibe Frank

Understanding SQL's Stored Procedures: A Complete Guide to SQL/PSM

Clement T Yu and Weiyi Meng

Advanced Database Systems

Carlo Zaniolo, Stefano Ceri, Christos Faloutsos,

Richard T Snodgrass, V S Subrahmanian, and

Roberto Zicari

Principles of Transaction Processing

Philip A Bernstein and Eric Newcomer

Using the New DB2: IBMs Object-Relational Database System

Edited by Jennifer Widom and Stefano Ceri

Migrating Legacy Systems: Gateways, Inte~aces, & the Incremental Approach

Michael L Brodie and Michael Stonebraker

Atomic Transactions

Nancy Lynch, Michael Merritt, William Weihl, and Alan Fekete

Location-Based Services

Jochen Schiller and Agn& Voisard

Database Modeling with Micros~" Visio for Enterprise

Architects

Terry Halpin, Ken Evans, Patrick Hallock, Bill

Maclean

Designing Data-Intensive Web Applications

Stephano Ceri, Piero Fraternali, Aldo Bongio,

Marco Brambilla, Sara Comai, and Maristella

Matera

Mining the Web: Discovering Knowledge from

Hypertext Data

Soumen Chakrabarti

Advanced SQL: 1999 Understanding Object-

Relational and Other Advanced Features

Jim Melton

Database Tuning: Principles, Experiments, and

Troubleshooting Techniques

Dennis Shasha and Philippe Bonnet

SQL:1999 Understanding Relational Language

Components

Jim Melton and Alan R Simon

Information Visualization in Data Mining and

Knowledge Discovery

Edited by Usama Fayyad, Georges G Grinstein,

and Andreas Wierse

Joe Celko's SQL for Smarties: Advanced SQL Programming, Second Edition

Cynthia Maro Saracco

Readings in Database Systems, Third Edition

Edited by Michael Stonebraker and Joseph M

Query Processing for Advanced Database Systems

Edited by Johann Christoph Freytag, David Maier, and Gottfried Vossen

Transaction Processing: Concepts and Techniques

Jim Gray and Andreas Reuter

Building an Object-Oriented Database System: The Story of 02

Edited by Fram;ois Bancilhon, Claude Delobel, and

Paris Kanellakis

Database Transaction Models for Advanced Applications

Edited by Ahmed K Elmagarmid

A Guide to Developing Client~Server SQL Applications

Setrag Khoshafian, Arvola Chan, Anna Wong, and Harry K T Wong

The Benchmark Handbook for Database and Transaction Processing Systems, Second Edition

Edited by Jim Gray

Camelot and Avalon: A Distributed Transaction Facility

Edited by Jeffrey L Eppinger, Lily B Mummert, and Alfred Z Spector

Readings in Object-Oriented Database Systems

Edited by Stanley B Zdonik and David Maier

Trang 4

ELSEVIER

A m s t e r d a m 9 Boston

Heidelberg L o n d o n 9

N e w York Oxford 9 Paris 9

San D i e g o San Francisco

S i n g a p o r e S y d n e y 9 Tokyo 9 M O R G A N K A U F M A N N P U B L I S H E R S

Trang 5

Dartmouth Publishing, Inc

Elliot Simon Jacqui Brownstein Northwind Editorial Services Maple-Vail Book Manufacturing Group Phoenix Color

Morgan Kaufmann Publishers is an imprint of Elsevier

500 Sansome Street, Suite 400, San Francisco, CA 94111

This book is printed on acid-free paper

Designations used by companies to distinguish their products are often claimed as trademarks or registered trademarks In all instances in which Morgan Kaufmann Publishers is aware of a claim, the product names appear

in initial capital or all capital letters Readers, however, should contact the appropriate companies for more complete information regarding trademarks and registration

No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means-electronic, mechanical, photocopying, scanning, or otherwise-without prior written permission of the publisher

Permissions may be sought directly from Elsevier's Science & Technology Rights Department in Oxford, UK: phone: (+44) 1865 84383O, fax: (+44) 1865 853333,

e-mail: permissions@elsevier.co.uk You may also complete your request on-line via the Elsevier homepage (http://elsevier.com) by selecting "Customer Support" and then "Obtaining Permissions."

Library of Congress Cataloging-in-Publication Data

Application submitted

ISBN 13:978-1-55860-711-8

ISBN 10:1-55860-711-0

For information on all Morgan Kaufmann publications,

visit our Web site at www.mkp.com or www.books.elsevier.com

Printed in the United States of America

06 07 08 09 10 5 4 3 2 1

Trang 6

To rescued Shelties, and Shelties in need of rescue, everywhere Especially

to senior Shelties who, after years of devotion to their owners, are cruelly discarded for the most pathetic of reasons: "We're thinking about moving", "She's just in the way", "He's too old to be fun any more", and the worst of all - "We're getting a p u p p y and, you know " And to the loving people who welcome these old dogs into their lives, knowing that older Shelties are calmer, settled, cuddly, and devoted - they selflessly deal with medical needs, arthritic limitations, and the piddles of old age Wonderful karma accrues to these people for giving these seniors love and respect, allowing them to live out their lives in comfort and happiness

Jim

To my M u m and Dad, for their long, long journey

Stephen

Trang 7

This Page Intentionally Left Blank

Trang 8

9 Additional resources xxv

9 Type conventions xxv

9 Acknowledgements xxv

P a r t I X M L : D o c u m e n t s a n d D a t a

C h a p t e r I X M L

I I Introduction 3 1.2 Adding Markup to Data 3 1.2 I Raw Data 4 1.2.2 Separating Fields 4 1.2.3 Grouping Fields Together 5 1.2.4 Naming Fields 6

1.2.5 A Structural Map of the Data 8 1.2.6 Markup and Meaning 12 1.2.7 Why XML? 13

1.3 XML-Based Markup Languages 14 1.4 XML Data 19

1.4.1 Structured Data 19 1.4.2 Unstructured Data 20

x v i i

x i x

vii

Trang 9

1.5.2 Presentation Languages ~ Presentation Only 24 1.5.3 SGML 26

1.5.4 HTML 27 Chapter Summary 28

Chapter 2 Q u e r y i n g

2.1 2.2

2.3

2.4

Introduction 31

2 I I Definitions of Query 3 I Querying Traditional Data 32 2.2 I The Relational Model and SQL 33 2.2.2 Extensions to SQL 36

2.2.3 Querying Traditional Data ~ Summary 38 Querying Nontraditional Data 39

2.3 I Metadata 40 2.3.2 Objects 41 2.3.3 Markup 41 2.3.4 Querying Content 43 Chapter Summary 43

Chapter 3 Q u e r y i n g X M L

3 I Introduction 45 3.2 Navigating an XML Document 46 3.2 I Walking the XML Tree 48 3.2.2 Some Additional Wrinkles 56 3.2.3 Summary ~Things to Consider 60 3.3 What DoYou Know about Your Data? 61 3.4 SomeWays to Query XMLToday 63 3.5 Chapter Summary 64

P a r t II M e t a d a t a and X M L

Chapter 4 M e t a d a t a - - A n O v e r v i e w

4.1 4.2 4.3 4.4

Introduction 67 Structural Metadata 69 Semantic Metadata 75 Catalog Metadata 78

31

45

65

6 7

Trang 10

Contents ix

C h a p t e r 5

C h a p t e r 6

4.5 4.6

Integration Metadata 82 Chapter Summary 84

S t r u c t u r a l M e t a d a t a

5.1 5.2

5.3

5.4

5.5 5.6

85

Introduction 85 DTDs 86 5.2.1 SGML Heritage 87 5.2.2 Relatively Simple, Easy to Write, and Easy to Read 88 5.2.3 Limited Capabilities, Especially with Respect to Data Types 94 5.2.4 An Example Document and DTD 97

XML Schema 100 5.3 I Exploring an XML Schema 101 5.3.2 Simple Types (Primitive Types and Derived Types) 107 5.3.3 Complex Types and Structures 110

Other Schema Languages for XML II 5 5.4.1 RELAX NG 115

5.4.2 Schematron 117 5.4.3 Decisions, Decisions, Decisions 118 Deriving an Implied Schema from a DTD II 9 Chapter Summary 120

T h e X M L I n f o r m a t i o n Set ( I n f o s e t ) and B e y o n d

6.1 6.2 6.3 6.4 6.5 6.6

6.7 6.8 6.9

123

Introduction 123 What Is the Infoset? 124 The Infoset Information Items and Their Properties 125 The Infoset vs.the Document 133

The XPath 1.0 Data Model 136 The Post-Schema-Validation Infoset (PSVI) 138 6.6 I Infoset + Additional Properties and Information Items 139 6.6.2 Additional Information in the PSVI 140

6.6.3 Limitations of the PSVI 141 6.6.4 Visualizing the PSVI 142 The Document Object Model ( D O M ) ~ A n API 142 Introducing the XQuery Data Model 146

A Note Regarding Data Model Terminology 147 6.10 Chapter Summary and Further Reading 149

Trang 11

P a r t III Managing and Storing X M L f o r Q u e r y i n g 151

Chapter 7 Managing X M L : T r a n s f o r m i n g and Connecting 153

7.1 7.2

7.3

7.4 7.5

Introduction 153 Transforming, Formatting, and Displaying XML 154 7.2 I Extensible Stylesheet Language Transformations (XSLT) 155 7.2.2 Extensible Stylesheet Language: Formatting

Objects (XSL FO) 162 The Relationships between XML Documents 163 7.3.1 XML Inclusions (Xlnclude) 164

7.3.2 XML Pointer Language (XPointer) 168 7.3.3 XML Linking Language (XLink) 173 Relationship Constraints: Enforcing Consistency 185 Chapter Summary 191

8 I Introduction 193 8.2 The Need for Persistence 194 8.2 I Databases 195

8.2.2 Other Persistent Media 200 8.2.3 ShreddingYour Data 20 I 8.3 SQL/XML's XMLType 206 8.4 Accessing Persistent XML Data 207 8.5 XML on the Fly: Nonpersistent XML Data 209 8.6 Chapter Summary 21 I

193

Chapter 9 X P a t h 1.0 and X P a t h 2.0

9.1 9.2

9.3

9.2.1 9.2.2 9.2.3 9.2.4 9.2.5 9.2.6 9.2.7 9.2.8

Introduction 215 XPath 1.0 217 Expressions 218 Contexts 222 Paths and Steps 224 Axes and Shorthand Notations 228 Node Tests 239

Predicates 241 XPath Functions 243 Putting the Pieces Together 248 XPath 2.0 Components 252

215

Trang 12

Contents xi

9.4 9.5

9.3.1 Expressions 252 9.3.2 The f o r and r e t u r n Expressions 256 XPath 2.0 and XQuery 1.0 258

Chapter Summary 259

Chapter I 0 I n t r o d u c t i o n t o X Q u e r y 1.0

10.1 Introduction 261 10.2 A Brief History 262 10.3 Requirements 264 10.3.1 General Requirements for XQuery 266 10.3.2 Data Model Requirements 267

10.3.3 XQuery Functionality Requirements 268 10.3.4 XPath 2.0 Requirements 269

10.4 Use Cases 269 10.5 The XQuery 1.0 Suite of Specifications 275 10.5.1 XQuery 1.0 Language Specification 276 10.5.2 XPath 2.0 and XQuery 1.0 Formal Semantics 278 10.5.3 XPath 2.0 and XQuery 1.0 Functions & Operators 278 10.5.4 XQuery 1.0 Serialization 279

10.5.5 XQueryX 280 10.6 The Data Model 280 10.6.1 Data Model Instances 282 10.6.2 What Is an XQuery Data Model Instance? 283 10.6.3 The Seven Kinds of Nodes 284

10.6.4 The Data Model as T r e e - Representing aWelI-Formed Document 293

10.6.5 The Data Model as Sequence- Representing an Arbitrary Sequence 295

10.7 The XQueryType System 297 10.7 I What Is a Type System Anyway? 297 10.7.2 XML SchemaTypes 300

10.7.3 From XML Schema to the XQueryType System 304 10.7.4 Types and Queries 305

10.8 XQuery 1.0 Formal Semantics and Static Typing 306 10.8.1 Notations 307

10.8.2 Static Typing 31 I 10.8.3 Dynamic Semantics 312 10.9 Functions and Operators 313 10.9.1 Functions 313

10.9.2 Operators 316 10.10 XQuery 1.0 and XSLT 2.0 Serialization 319

261

Trang 13

xii Contents

10.10.1 XML Output Method 322 10.10.2 XHTML Output Method 325 10.10.3 HTML Output Method 326 10.10.4 Text Output Method 327 10.1 I Chapter Summary 327

Chapter I I X Q u e r y 1.0 Definition

I I.I Introduction 329 11.2 Overview of XQuery 330 11.2.1 Concepts 330 11.3 The XQuery Processing Model 333 11.3.1 The Static Context 334 11.3.2 The Dynamic Context 337 11.4 The XQuery Grammar 338 11.5 XQuery Expressions 339

I 1.5.1 Literal Expressions 341 11.5.2 Constructor Functions 342 11.5.3 Sequence Constructors 343 11.5.4 Variable References 345 11.5.5 Parenthesized Expressions 346 11.5.6 Context Item Expression 346 11.5.7 Function Calls 346

11.5.8 Filter Expressions 349 11.5.9 Node Sequence-Combining Expressions 349

I 1.5.10 Arithmetic Expressions 35 I 11.5 II Boolean Expressions: Comparisons and Logical Operators 354

I 1.5.12 Constructors ~ Direct and Computed 361 11.5.13 Ordered and Unordered Expressions 370

I 1.5.14 Conditional Expression 371

I 1.5.15 Quantified Expressions 372

I 1.5.16 Expressions on XQueryTypes 374

I 1.5.17 Validation Expression 378 11.6 FLWOR Expressions 380 11.6.1 The f o r Clause and the l e t Clause 380 11.6.2 The where Clause 389

11.6.3 The o r d e r by Clause 390 11.6.4 The r e t u r n Clause 392 11.7 Error Handling 393

11.8 Modules and Query Prologs 394 11.8.1 Prologs 395

329

Trang 14

Contents xiii

11.8.2 Main Modules 398 11.8.3 Library Modules 400 11.9 A Longer Example with Data 402

II 10 XQuery for SQL Programmers 402 I1.11 Chapter Summary 403

Chapter 12 X Q u e r y X

12.1 Introduction 407 12.2 How Far to Go? 408 12.2.1 Trivial Embedding 409 12.2.2 Fully-Parsed XQuery 410 12.2.3 The XQueryXApproach 41 I 12.3 The XQueryX Specification 416 12.4 XQueryX By Example 417 12.4.1 The Simplest XQueryX Example ~ 42 417 12.4.2 Simple XQueryX Example 423

12.4.3 Useful XQuery Example 430 12.5 Querying XQueryX 433

12.5 I Querying XQueryX for XQueryTuning 434 12.5.2 Querying XQueryX for Application Improvement 436 12.6 Chapter Summary 437

Chapter 13 What's Missing?

13.1 13.2

13.3

Introduction 439 Full-Text 440 13.2 I What Is a Full-Text Query? 440 13.2.2 Full-Text and XML 448

13.2.3 Defining XQuery Full-Text 449 13.2.4 W3C XQuery Full-Text ~ Grammar Extension 455 13.2.5 W3C XQuery Full-Text ~ Some Discussion Topics 471 13.2.6 XQuery Full-Text ~ Some Implementations 474 Update 478

13.3 I Motivation:Where/WhyWe Need Update 479 13.3.2 Requirements 481

13.3.3 Alternatives: Syntax and Semantics 485 13.3.4 How Products Handle Update Today 488 13.3.5 What Lies Ahead? 495

13.4 Chapter Summary 495

4 0 7

439

Trang 15

xiv Contents

Chapter 14 X Q u e r y APIs

14.1 Introduction 497 14.2 Alphabet-Soup Review 498 14.2.1 ODBC andJDBC 499 14.2.2 DOM, SAX, StAX, JAXP, JAXB 501 14.2.3 Alphabet-Soup Summary 502 14.3 XQJ ~ XQuery for Java 503 14.3.1 Connecting to a Data Source 504 14.3.2 Executing a Query 507

14.3.3 Manipulating XML Data 509 14.3.4 Static and Dynamic Context 517 14.3.5 Metadata 518

14.3.6 Summary 519 14.4 SQL/XML 520 14.5 Looking Ahead 521

Chapter 1 5 SQIL/XML

15.1 Introduction 523 15.2 SQL/XML Publishing Functions 526 15.2.1 Examples 526

15.2.2 XMLAGG 529 15.2.3 XMLFOREST 531 15.2.4 XMLCONCAT 535 15.2.5 Summary 536 15.3 XML DataType 537 15.4 XQuery Functions 540 15.4.1 XMLQUERY 541 15.4.2 XMLTABLE 546 15.4.3 XMLEXISTS 570 15.5 Managing XML in the Database 572 15.6 Talking the Same Language ~ Mappings 573 15.6.1 Character Sets 573

15.6.2 Names 574 15.6.3 Types andValues 575 15.7 Chapter Summary 580

P a r t V Q u e r y i n g and The W o r l d Wide Web

Chapter 16 X M L - D e r i v e d M a r k u p Languages

16.1 Introduction 585 16.2 Markup Languages 586

4 9 7

523

583

5 8 5

Trang 16

Contents xv

16.2.1 MathML 587 16.2.2 SMIL 591 16.2.3 SVG 594 16.3 Discovery on the World Wide Web 597 16.4 Customized Query Languages 602 16.5 Chapter Summary 604

Chapter 17 I n t e r n a t i o n a l i z a t i o n : P u t t i n g t h e " W " in " W W W " 6 0 5

17.1 Introduction 605 17.2 What Is Internationalization? 606 17.3 Internationalization and theWorld WideWeb 607 17.3.1 Unicode 609

17.3.2 W3C Character Model for theWorld WideWeb 615 17.4 Internationalization Implications: XPath, XQuery, and SQL/XML 618 17.5 Chapter Summary 621

Chapter 18 Finding Stuff

18.1 Introduction 623 18.2 Finding Structured Data ~ Databases 624 18.3 Finding Stuff on theWeb ~ W e b Search 625 18.3.1 The Google Phenomenon 625

18.3.2 Metadata 627 18.3.3 The SemanticWeb ~ T h e Search for Meaning 628 18.3.4 The DeepWeb ~ Feel theWidth 637

18.4 Finding Stuff atWork ~ Enterprise Search 638 18.5 Finding Other People's S t u f f ~ Federated Search 640 18.6 Finding Services ~ W S D L , UDDI,WSIL, RDDL 641 18.7 Finding Stuff in a More NaturaIWay 644

18.8 Putting It All Together ~ T h e Semantic Web+ 645

623

Appendix A The Example

A.I A.2 A.3

A.4 A.5

Introduction 647 Example Data 648 A.2 I Movies We Own 648 Some Examples from the Book 698 A.3.1 XQuery Examples 699 A.3.2 SQL/XML Examples 709

A SimpleWeb Application 729 Summary 749

6 4 7

Trang 17

xvi Contents

A p p e n d i x B S t a n d a r d s Processes

B.I B.2

B.3

B.4

B.5

Introduction 751 World WideWeb Consortium (W3C) 753 B.2 I What Is the W3C? 753

B.2.2 TheW3C Process Document 754 B.2.3 TheW3C Stages of Progression 755 Java Community Process (JCP) 757

B.3 I What Is the JCP? 757 B.3.2 JSRs and Expert Groups: Formation and Operation 758 B.3.3 The JSR Stages of Progression 760

De Jure Standards:ANSI and ISO 761 B.4 I The De Jure Process and Organizations 761 B.4.2 The SQL/XML Standardization Environment 764 B.4.3 Stages of Progression 766

Summary 769

A p p e n d i x C G r a m m a r s

C.I C.2 C.3 C.4

Introduction 771 XQuery Grammar 771 SQL/XML Grammar 779 Chapter Summary 788

Trang 18

Foreword

by Don Chamberlin

IBM Fellow

Almaden Research Center

Companies come and go in the database industry, but one thing remains constant: Jim Melton remains at the center of the database standards community For more years than anyone cares to remem- ber, Jim has served as editor of the international standard for the SQL database language Perhaps more importantly, he has translated this standard into terminology that ordinary people can understand and has made it accessible to everyone in a series of successful books

N o w the database world is undergoing its most important transi- tion since the advent of the relational data model in the 1970's A new self-describing data format, XML, is emerging as the standard format for exchange of semi-structured data on the Web XML is fundamentally different from relations because it carries descriptive metadata with each data instance rather than storing it in a separate catalog This new format gives unprecedented flexibility for representing various types of data but at the same time it requires a new approach to query

A collection of query-related standards is emerging around the XML data format, and as usual Jim Melton is at the center of the

xvii

Trang 19

xviii Foreword

action Jim is co-chair of the W3C XML Query Working Group, which

is creating an important new language called XQuery and (together with the XSLT Working Group) is revising the well-known XPath language Jim is also co-Spec Lead for XQJ, the Java interface to XQuery that is being developed under the Java Community Process

In addition, as editor of the SQL Standard, Jim serves as editor of SQL/XML, the set of SQL extensions that enable relational databases

to store and query XML data

Stephen Buxton is also a long-time member of the W3C XML Query Working Group, and a specialist in full-text search and retrieval Stephen's expertise in approximate queries on unstructured text complements Jim's long experience with exact queries on structured data

In short, there is no more authoritative pair of authors on Query- ing XML than Jim Melton and Stephen Buxton Best of all, as readers

of Jim's other books know, his informal writing style will teach you what you need to know about this complex subject without giving you a headache If you need a comprehensive and accessible overview of Querying XML, this is the book you have been waiting for

Don Chamberlin December 2005

Trang 20

Preface

Why the subject matter is important

In a remarkably short period, XML has arguably become the most important language for marking up documents for the World Wide Web and for industry in general Equally important, XML is rapidly becoming the lingua franca for marking up traditional business data, for exchanging information between business partners and between application programs, and for expressing a host of concepts that improve the usability of computer systems

While it may be tempting to view XML as a "silver b u l l e t " - a solution to all of our p r o b l e m s - t h e truth is a bit more prosaic: XML

is merely a tool (admittedly a very important one) that can help solve

a significant range of problems Like most tools, XML introduces tradeoffs and complications Among the difficulties that XML users will increasingly encounter are the ones posed by locating and retrieving information stored in documents marked up using XML

As you'll learn in this book, there are many approaches to querying XML documents and repositories of such documents We cannot claim to have addressed every possible approach, or even every approach in use at the time we wrote this book There are simply too many possibilities and alternatives, too many researchers and practi- tioners inventing new technologies Instead, we have focused on the

xix

Trang 21

is properly called an XML document XML that cannot stand by itself

is sometimes called an XML fragment In general, t h r o u g h o u t this book, we use the w o r d "document" or "fragment" w h e n a specific sort of XML is being referenced and we need to be clear about the nature of that XML Otherwise, we mostly use the raw term "XML" and d e p e n d on the context to disambiguate our usage

Why we wrote this book

"XML" is an enormous topic for any individual to understand The term has come to imply m u c h more than the m a r k u p language of the same name Due in large part to the versatility of the m a r k u p language and the enormous utility of the Internet and the World Wide Web, there are countless computer scientists and software engineers developing specifications, tools, application programs, and even hardware that use or d e p e n d on some use of XML

There are m a n y fine books available that can teach you h o w to

m a r k up your documents and your data with XML, h o w to use the eXtensible Stylesheet Language (XSL) to transform documents into other documents, h o w to use the m a n y tools such as XML parsers and XSL transformation engines, and so forth There are even several available books focused exclusively on XQuery, the almost-finalized W3C XML Query language

But we have not seen any books that cover a broader subject that

we think is vital: how to locate information in documents that are

m a r k e d up using XML and how to find and extract that information

in repositories of such documents It is certainly important to m a r k

up your documents and your data to capture the m e a n i n g inherent

in them, but tremendous additional value is available w h e n you can use powerful query facilities that not only find certain documents in

a repository, but also find and extract the fine-grained information contained in those documents

Trang 22

How the book is organized ~i

In this book, we identify and explore several approaches to querying XML documents, concentrating on those that we believe are most likely to be important in the near-to-medium future We also give you a perspective on some of the other technologies that are closely related to the subject of querying XML In doing so, we give you not only valuable insights about locating and retrieving information in XML documents, but we put the subject into the contexts in which it will be used

Who should read this book

We wrote this book primarily to benefit software engineers who have

to design and build applications that use XML and to access documents and data presented in an XML form While the subject is nec- essarily technical in nature and presentation, we decline to focus exclusively on production of lines of code Instead, we approach mastery of the subject by ensuring that readers understand the rea- son a particular topic is important, that they know the context in which the topic is relevant, that the principles of the topic are made

clear, and that the details of writing code appropriate to the topic are

illustrated and exemplified

The book should be of interest to more than just software develop- ers, though Architects of software systems that use XML must know how search and retrieval issues are to be handled, while managers and team leaders need an understanding of the relationships between XML markup and storage and future retrieval of documents based on the semantics of the information they contain

How the book is organized

This book is divided into several parts Part I, "XML: Documents and Data", starts off with a survey of structured document technology and examines several languages used to produce a n d / o r represent such documents It continues with an exploration of the problems associated with querying data generally, as well as with searching XML documents, and includes a comparison of querying XML with the use of SQL used to query traditional data

Part II, "Metadata and XML', introduces the subject of metadata for X M L - i n f o r m a t i o n that describes XML documents and m a r k u p languages This part covers Document Type Definitions (DTDs) and XML Schemas (with some attention given to competing XML

Trang 23

xxii Preface

schema definition languages) We discuss the "meaning" of XML markup and survey its use in a number of different XML-related markup languages This part finishes with a presentation of XMUs Information Set (commonly k n o w n as the Infoset) and an introduction to several other data models used to describe XML documents

in a formal manner

Part III, "Managing XML for Querying", looks at the different sorts of databases (e.g., relational, object-relational, object-oriented, and so-called "native XML') in which XML documents are being stored It also examines several other W3C specifications that play a role in XML documents that might be queried This part of the book includes some information about a number of current products that are used to store, manage, query, and retrieve XML documents Part IV, "Querying XML', is the technical heart of the book, describing four ways to query XML XPath (the XML Path Language)

is already an established language for querying within an XML document, so this part begins with a significant discussion of the XPath and its usage for XML querying XQuery is a brand new language designed specifically for querying XML, so we will spend a lot of time and detail on it, including an analysis of the type system and data model used by that language, an examination of the formal semantics of the language, and a discussion (replete with examples)

of the use of XQuery and its companion XQueryX SQL is the leading query language for structured data today We explore the ways that SQL can be used to query XML, especially if the XML is "shredded" and stored in an object-relational form Finally, in this Part we discuss SQL/XML, a set of extensions to SQL that leverage XPath and XQuery to overcome some of SQUs limitations in managing semistructured data

Part V, "Querying and the World Wide Web", provides a look at a number of specific XML-based markup languages and responds to the question of whether XPath, XQuery, SQL, a n d / o r SQL/XML are suitable for querying documents that are marked up using such languages or whether other, more specific, query facilities are needed to deal with them It also looks at the ways in which XML is, and is going to be, used on the Internet, both for casual uses like browsing and for industrial uses such as data interchange between business partners The impacts of internationalization on XML and related specifications are addressed here as well

We finish up the book with appendices that give you a glimpse into the way in which open standards like XML, XQuery, and SQL/ XML are developed, that contain the complete grammar of XQuery,

Trang 24

Syntax Conventions xxiii

that list and describe all of the SQL/XML functions, and that provides a lengthy set of examples and a small sample of data against which they have been tested

The example we're using

We are both avid fans of the c i n e m a - w h i c h is illustrated by the fact that, between us, we subscribe to just about every possible movie channel offered by satellite television providers Continuing the tra- dition started in earlier books written by Jim, we've chosen to use the

subject of movies as the basis for our example We've collected data

on a broad range of films and organized it into a sort of "database" that is, in fact, a modestly large XML document This document - data with XML markup - serves as the foundation for many of our examples (Note that we do not pretend that our example document

is marked up in any sort of optimal way, suitable for industrial use;

we chose specific markup styles to illustrate the points we make at various parts of the book.) When the topic demands something a lit- tle less data-oriented, we use a smallish textual document that dis- cusses several film-related topics

Syntax Conventions

In several places in this book, we define the syntax of various language components relevant to XML, XML query languages, and so forth While we are not particularly fond of the syntax conventions that the W3C has adopted (we find them somewhat less readable than several other conventions), we believe that readers of this book will

be best served by consistency of style accompanied by explanations Therefore, we have (with slight reluctance) adopted the same style used in the W3C specifications that we reference in the book You may be familiar with those conventions, but we think that a quick summary will help some readers

A variation of Backus-Naur Form (BNF) is used for syntax presen-

tation More specifically, a syntactic symbol (called a nonterminal sym-

bol to distinguish it from language components that represent only

themselves) is defined using a notation in which the symbol being defined appears to the left of a special operator ( -=) and the definition of that symbol appears as an expression written following that operator For example:

Trang 25

xxiv Preface

n o n t e r m i n a l - x ::= n o n t e r m i n a l - y ( ',' n o n t e r m i n a l - y )*

That line, called a BNF production, defines a n o n t e r m i n a l s y m b o l ( n o n t e r m i n a l - x ) by saying that it is m a d e u p of a second n o n t e r m i - nal s y m b o l ( n o n t e r m i n a l - y ) , optionally followed by zero or m o r e (that's the m e a n i n g of the asterisk, *) repetitions of a s e q u e n c e m a d e

u p of a literal c o m m a (that's a terminal symbol) a n d a n o t h e r instance

of that second n o n t e r m i n a l s y m b o l ( n o n t e r m i n a l - y ) Therefore, if n o n t e r m i n a l - y h a p p e n s to be defined to be an identifier (in XML, these are either QNames or NCNames), t h e n an instance of n o n t e r m i n a l - x m i g h t be:

f i l m , c i n e m a , m o v i e

O n e i m p o r t a n t t h i n g to note is that, in this style of BNF, all

t e r m i n a l symbols are enclosed in q u o t a t i o n marks, w h i c h m i g h t be single q u o t a t i o n m a r k s (' ') or d o u b l e q u o t a t i o n m a r k s ( " " )

A n y t h i n g , i n c l u d i n g parentheses, not enclosed in q u o t a t i o n m a r k s is either a n o n t e r m i n a l s y m b o l or a character u s e d in the BNF to specify its m e a n i n g

Here is a c o m p l e t e list of the c o n v e n t i o n s u s e d in this b o o k by this style of BNF:

9 " s t r i n g " - - the literal s t r i n g given inside the d o u b l e quotes

9 ' s t r i n g ' - the literal s t r i n g given inside the single quotes

9 a b a single occurrence of a followed by a single occurrence of b

9 a I b - - a single occurrence of a or a single occurrence of

9 / * * / - a c o m m e n t in the BNF (this is u n r e l a t e d to

c o m m e n t s in l a n g u a g e s being defined by the BNF, such as XQuery)

Trang 26

Acknowledgements xxv

Additional resources

The data and queries in appendix A, plus additional examples and explanations, are available for download from the web site for this book's examples, http://xqzone.marklogic.com/queryingxmlbook/ You may also visit http://www.mkp.com/QueryingXML for more information

Type conventions

A quick note on the typographical conventions we use in this book seems in order:

9 Type in this font is used for all ordinary text

9 Type in this font is used for terms that we define or for

emphasis

9 Type i n t h i s f o n t is used for all the examples, syntax presentations, keywords, identifiers, and XML text that appear in ordinary text

Acknowledgements

Writing a book is an immense task and it consumes enormous quan- tities of resources such as energy, time for research and for writing, and often patience A book like this one is quite difficult to produce, but difficult tasks often produce commensurately great rewards (financial rewards very rarely among them!) It's exceedingly rare to

do it a l o n e - t h e help, guidance, and support of others is always appreciated: for ideas, for trying out concepts and wording, for reviewing paragraphs and whole chapters, and just for offering encouragement

We want to give credit to all of the wonderful, talented people who have helped us create this book, especially the following people (alphabetized by their last names) who gave us extensive reviews, which heavily influenced the content and accuracy of this book

9 James Bean, author of "XML for Data Architects: Designing for Reuse and Integration" and "Engineering Global E-

Trang 27

9 Muralidhar Krishnaprasad, our friend and colleague at Oracle, who seems to be an expert at all things related to XQuery, especially its implementation

9 Zhen Hua Liu, also our friend and colleague at Oracle, who is

a driving force behind the implementation of SQL/XML and

a constant source of valuable information and observations

Of course, all remaining errors (and we harbor no illusions that

we found and eliminated all errors in a subject as complex as this one) are solely our responsibility

We also offer our deepest gratitude to the wonderful people at Morgan Kaufmann Publishers for their invaluable help and partici- pation in the production of the book Diane Cerra, our talented and patient editor, who trusted Jim enough to publish his first book, got

us started on this book and came back to help us finish it Two other editors, Lothl6rien Homet and Rick Adams, worked with us for several months during the time when we were writing the most difficult chapters

At various times during the lengthy writing process, Asma Stephan, Corina Derman, Mona Buehler, and Belinda Breyer made themselves available to answer our questions about schedules and production, to track down information that we managed to mis- place, to make sure that our chapters were quickly reviewed by the right people, and to give us frequent and friendly reminders of approaching deadlines Our production manager, Simon Crump, worked closely and patiently with us during the production process, making sure that our drafts were thoroughly copyedited and properly typeset, that our reviews of the galleys were applied to the typeset draft, and that all production errors were promptly handled Brent dela Cruz, our marketing manager, bears the burden of ensuring that this book is made available to you, our readers To Diane, Asma, Simon, Brent, and all of the other fantastic people at Morgan

Kaufmann, thanks!

Trang 28

Acknowledgements xxvii

Credit must also be given to the incredible group of people who make up the various W3C Working Groups responsible for the specifications discussed in this book The languages and facilities related

to querying XML documents include XML Query (co-chaired by Jim's long-time friend and colleague Andrew Eisenberg), XSL (chaired by the delightful Sharon Adler), and XML Schema (first chaired by one of the most generous and smartest people around, Michael Sperberg-McQueen, and now chaired by our good friend David Ezell, who is proving to be remarkably good at herding cats), among others

We are particularly grateful to our friends who offered sugges- tions that certainly improved the content and focus of the book They include Ashok Malhotra, Andrew Eisenberg, Murali Krishnaprasad, and Zhen Hua Liu

Finally, we want to express our appreciation to Don Chamberlin for writing the Foreword to this book Don wrote the Foreword for Jim's first SQL book and it feels like we've reached a sort of closure, coming full circle on SQL and starting a new circle for the next major query language

Jim: I give special thanks to my wonderful partner, best friend, and spouse, Barbara Edelberg She took up all the slack when I was stuck at the computer 'til all hours of the night, writing Barbara had

to deal with me on the road and unavailable so much of the time It was Barbara's emotional support and encouragement, as I agonized over every sentence in the book, that got me through it I also owe a debt of gratitude to my co-author, friend, and backpacking buddy, Stephen Buxton, for stepping in to write the book with me - he joined me just as I was falling into despair at the magnitude of the task and the difficulty of writing this book while doing my "day job" Stephen: I'd like to say thank you to my family for their support and encouragement - my kids Maria and Samuel, and my other

"kids" Jennie and Sarah, and most of all, my lovely wife Veronica ("I thought you said it was finished!"), who has stuck with me through many, many late nights and weekends I'd also like to thank my co- author, erstwhile colleague, and very good friend Jim Melton for guiding me through my first authoring experience Thanks Jim!

Trang 29

Trang 30

XML: Documents

and Data

Trang 31

Trang 32

XML - the Extensible Markup Language - defines a set of rules for adding markup to data Markup adds structure to data, and gives

us a way of talking about the meaning of that data The family of XML technologies provides a way to standardize the representation

of data, so that we can process any data with standard programs, share data across applications, and transfer data from one person or application to another In this first chapter, we introduce XML by looking at what markup is and what it's good for Then we look at a number of different uses for XML - a number of different kinds of XML data Finally, we give examples of other ways to represent data, and compare them with XML

Adding Markup to Data

Let's take the movies example (Appendix A: The Example) used throughout this book We have data describing many of our favorite movies The data includes the title of the movie, the year it was first released, the names of some of the cast members, and other informa-

Trang 33

Raw Data

We could represent our movie data in r a w form, as in Example 1-1

Example 1-1 movie, Raw Data

An American Werewolf in London1981LandisJohnFolseyGeorge,

Jr GuberPeterPetersJon98NaughtonDavidmaleDavid KesslerAgutterJennyfemaleAlex Price

Example 1-1 is the r a w data for one movie - a single record In this format, the data doesn't tell y o u m u c h about the movie You can probably spot the title, and, if y o u are familiar w i t h " A n A m e r i c a n Werewolf in London," y o u m a y be able to glean some information by

m e a n s of e d u c a t e d guesswork But if y o u w a n t e d to write a p r o g r a m

to read this data a n d do s o m e t h i n g w i t h it - such as finding the

n a m e of the director - y o u w o u l d have to write code specifically for this piece of data (e.g., code that extracts the characters at positions 41

t h r o u g h 44 and 35 t h r o u g h 40 a n d a d d s a space in b e t w e e n them)

W h a t we need is some w a y to represent the data so that a p r o g r a m (or person) can process any movie record in the same way

Separating Fields

A simple w a y to a d d some r u d i m e n t a r y structure to this record is to

a d d a c o m m a b e t w e e n each of the data items, or fields

Example 1-2 movie, Fields Separated by Commas

An American Werewolf in London, 1981,Landis,John,Folsey,George\, Jr., Guber, Peter, Peters, Jon, 98, Naughton, David, male, David

Kessler ,Agutter ,Jenny, female ,Alex Price

Example 1-2 is the same movie data represented as a c o m m a - s e p - arated list Notice that, even w i t h this simple mechanism, w e h a d to introduce the " \ " (backslash) character to "escape" a c o m m a that

w a s actually part of the data

There are other w a y s to distinguish b e t w e e n fields of a record In the early days of computing, fixed-length fields were c o m m o n -

Trang 34

1.2 Adding Markup to Data 5

as a continuation marker)

Let's continue our discussion with the comma-separated list in Example 1-2 You can spot the fields in this record, but there is no way of knowing which fields go together For example, the fields

"Agutter, Jenny," "female," and "Alex Price" each describe one aspect of a cast member, but it's not apparent from the comma-separated list that those fields have anything in common We have a way of delineating fields; now we need some way of grouping fields together

Grouping Fields Together

Example 1-3 groups fields together It also introduces a hierarchy of fields and subfields Fields are separated by one or more commas, and fields that belong together are bounded by "," at the start and

"$," at the end

Example 1-3 movie, Grouped Fields

,An A m e r i c a n W e r e w o l f i n L o n d o n $ , , 1 9 8 1 5 ,

f

, L a n d i s $ , , J o h n S ,

, G u b e r $ , , P e t e r s ,

, P e t e r s $ ,

Trang 35

1

1.2.4

Example 1-3 is s h o w n w i t h some extra white space - each sub- field starts on a n e w line, a n d is indented This is purely for (human) readability

N o w we k n o w that "Agutter, Jenny, female, Alex Price" all belongs together and is all related in some w a y to " A n A m e r i c a n Werewolf in London." A n d if y o u w a n t to write a p r o g r a m to extract the director

of each movie, given that each movie is formatted in the same w a y as

in Example 1-3, y o u can write some general code that will parse the movie into first, second, and third fields, extract the contents of the third field, a n d parse that to get the first and last n a m e of the director

We are m a k i n g progress! But Example 1-3 still has some short- comings There is no indication of w h a t a field represents, other t h a n its position within the record, w h i c h makes it difficult for h u m a n s to read This has two implications - first, the data is vulnerable to error If y o u (or the p r o g r a m generating the data) m a k e a mistake and leave out the year of release, it's not obvious that a n y t h i n g is missing, a n d a p r o g r a m processing this data m a y well return

"LandisJohn" w h e n asked for the year of release Second, it m a k e s it difficult to talk about the data Most of the time, w h e n we w a n t to

"talk about" the data, we w a n t to describe some m a n i p u l a t i o n to a

p r o g r a m - i.e., it's difficult to write a p r o g r a m that says things like

"print the second field of the third field of the movie record, then a space, then the first field of the third field of the movie record." O u r next step is to n a m e the fields and subfields

Naming Fields

If y o u read Example 1-3, y o u can probably guess that " A n A m e r i c a n Werewolf in London" is the title of the movie, a n d y o u m a y even deduce that Jenny Agutter plays the female lead, a character n a m e d Alex Price But w h o is Peter Guber? A n d w h a t does "98" mean?

Trang 36

1.2 Adding Markup to Data 7

W h a t w e n e e d is a w a y to n a m e each field, to m a k e it easier to talk

a b o u t the fields - to w r i t e p r o g r a m s t h a t m a n i p u l a t e t h e m - a n d also to give s o m e clue as to w h a t the fields a c t u a l l y m e a n We c o u l d devise a w a y to r e p r e s e n t field n a m e s as p a r t of o u r c o m m a - s e p a -

r a t e d list - p e r h a p s each c o m m a w o u l d be f o l l o w e d b y a field n a m e

E x a m p l e 1-4 is c l o s e to the X M L r e p r e s e n t a t i o n of m o v i e data that

w e w i l l u s e for the rest of this b o o k The " " a n d ".~" h a v e b e e n

Trang 37

Chapter 1 XML

1.2.5

replaced by "<tagname>" a n d "</tagname>." Each field in this record

- in XML terms, each element in this d o c u m e n t - has a name We can n o w refer to elements by n a m e a n d by their position w i t h respect

to other n a m e d elements A n d w h e n the n a m e is s o m e t h i n g m e a n - ingful, such as "producer," it gives a hint to the h u m a n r e a d e r about

w h a t the data means All w e n e e d n o w is a m a p of the data - actually two maps, one to tell us w h a t the structure of a movie record (a valid movie d o c u m e n t ) looks like, the other to tell us w h a t each ele-

m e n t actually means

A Structural Map of the Data

O n e useful k i n d of data m a p tells y o u s o m e t h i n g about the structure,

or "shape," of the d o c u m e n t - w h i c h fields are subfields of others

a n d in w h a t order they can a p p e a r in the d o c u m e n t Such a m a p is obviously useful for s o m e o n e m a n i p u l a t i n g the data, since she n e e d s

to k n o w that the d i r e c t o r element contains a f a m i l y N a m e a n d a givenName It's also useful for error-checking a n d consistency - every movie has a director, so if the d i r e c t o r element is missing,

t h e n the data is c o r r u p t e d or at best incomplete Let's take a look at a couple of structural data m a p s for XML - DTDs a n d XML Schemas 1

D TD - Document Type Definition

A n early a t t e m p t at p r o v i d i n g a m a p for XML w a s the DTD, or Docu-

m e n t Type Definition (actually the DTD w a s inherited from SGML - see Section 1.5.3) A DTD defines w h a t elements a n d attributes are allowed, where, a n d in w h a t order A DTD m a y also e n u m e r a t e the values allowed for each attribute (but not for elements), a n d it m a y identify some attributes as t y p e ID (meaning they m u s t h a v e a value that is u n i q u e across the XML document) or IDREF ( m e a n i n g they

m u s t m a t c h some attribute of t y p e ID) Example 1-5 s h o w s a possible DTD for the movie document 2

See also Chapter 5, "Structural Metadata."

Example 1-5 is one possible DTD that describes the movie document When you create a DTD based on a sample document, you can't tell which of the elements

in the sample are optional or which elements may occur more than once Some elements may be optionally present in a document but not present in your sample document If your document includes attributes, you can't tell which are IDs

or IDREFs, and you can only guess at attributes' enumerated values

Trang 38

1.2 Adding Markup to Data 9

<!ELEMENT movie (title, yearReleased, director, producer+,

runningTime, cast+)>

<!ELEMENT title (#PCDATA)>

<!ELEMENT yearReleased (#PCDATA)>

< ! ELEMENT director ( familyName, givenName, otherNames ? ) >

< ! ELEMENT producer ( familyName, givenName, otherNames ? ) >

< ! ELEMENT runningTime (#PCDATA) >

< ! ELEMENT cast ( familyName, givenName, otherNames ?,

maleOrFemale, character)>

<!ELEMENT familyName (#PCDATA)>

<!ELEMENT givenName (#PCDATA)>

<!ELEMENT otherNames (#PCDATA)>

<!ELEMENT maleOrFemale (#PCDATA)>

<!ELEMENT character (#PCDATA)>

The first line of Example 1-5 says that a movie m u s t contain a

n i n g T i m e , a n d at least one c a s t (member), in that order T h e following lines describe the "shape" of each of these elements Each simple (leaf) element, though, is described as " # P C D A T A " - despite its

n a m e ( D o c u m e n t Type Definition), the D T D does not give us any data type information 3 For example, it does not distinguish b e t w e e n

r u n n i n g T i m e (which is probably an integer) a n d t i t l e (which is

p r o b a b l y a string)

XML Schema

DTDs have a couple of d r a w b a c k s - they d o n ' t include any data type information about fields, 4 a n d DTDs are not XML documents XML Schema solves both these problems Like a DTD, an XML Schema defines w h e r e elements m a y occur in a d o c u m e n t , a n d in

w h a t order, in a formal, s t a n d a r d way But an XML Schema m a y also describe the data type of the element (integer, string, etc.) a n d give rules about w h i c h values are allowed A n d an XML Schema docu-

Though the DTD does not give us data type information, it does give us the type

of the document, in the sense of Schema's Complex Types

A DTD may include some data type information for attributes, such as

ID/IDREF type and enumeration

Trang 39

<xs:element name="familyName" type="xs:string"/>

<xs:element name="givenName" type="xs:string"/>

Trang 40

1.2 Adding Markup to Data 11

<xs:enumeration value="male"/>

<xs:enumeration value="female"/>

In Example 1-6, each element in the XML document is described by

an element in the XML Schema called x s : e l e m e n t A simple element such as t i t l e is modeled with the attributes x s : n a m e = " t i t l e " xs:type="xs:string" An element that has children (subfields), such as d i r e c t o r , is described by an x s - c o m p l e x T y p e element in this case, a sequence of elements The elements fam• and g• occur in several places in the XML document (the instance

document), so they are defined once at the start of the XML Schema and are pointed to (via the r e f attribute) whenever needed

Tiêu đề	Querying XML: XQuery, XPath, and SQL/XML in Context
Tác giả	Jim Melton, Stephen Buxton
Trường học	University of Information Technology and Communications
Chuyên ngành	Data Management Systems
Thể loại	tài liệu
Năm xuất bản	2024
Thành phố	Hà Nội

Định dạng
Số trang	845
Dung lượng	33,34 MB