An XML instance is composed of elements expressed in tag pairs (except for empty tags) plus optional attributes that always have quoted values and optional data that appears between th[r]
Trang 1© Copyright IBM Corporation 2004 Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Welcome to:
3.1
What Is XML?
Trang 2Unit Objectives
After completing this unit, you should be able to:
Describe the basic rules of XML
Describe what it means for an XML document to be well-formedList the components that make up an XML document
Differentiate between XML and HTML
Describe the internationalization support in XML
Define some best practices for XML
Trang 3What Is XML?
At its core XML is text formatted to follow a well-defined set of rules XML documents consist primarily of tags and text
If you've ever seen the source to an HTML document, then the
XML structure should look familiar
This text may be stored/represented in:
A normal file stored on disk
A message being sent over HTTP
A character string in a programming language
A CLOB (character large object) in a database
Any other way textual data can be used
XML documents do not need to exist as documents they may be: Byte streams sent between applications
Fields in a database record
Collections of XML Infoset information items
For simplicity they will be referred to as though they are
documents and files
Trang 4XML documents should be thought of as a hierarchical tree structure.
Example Tree Representation of XML
"Tom
"The Right Stuff"
Trang 5<?xml version="1.0"?> "Optional" first line; only required if encoding IS NOT UTF-8 or UTF-16*
<title>
Alphabet from A to Z
</title>
First child element with data
<isbn number=" 1112-23-4356 " /> Empty element (no data)
<author> Begin element tag
<firstName> Boreng </firstName>
<lastName> Riter </lastName> Nested child elements
</author> End element tag
<chapter title=" Letter A ">
The letter A is the first in
the alphabet It is also the
first of five vowels.
</chapter>
Element containing an attribute and parsed character data (PCDATA) [TBD]
<! The rest of the letter
chapters are missing > Comment
<chapter title=" Letter Z ">
The letter Z is the last
letter in the alphabet
</chapter>
Last element in document
A Simple XML Document - Basic Structure
Trang 6A Simple XML Document -
Basic Nomenclature
The XML instance on the previous page consists of:
One main element book
Subelements title, isbn, author, chapter, and comment
Author contains other subelements firstName and lastName
ISBN and chapter contain attributes number and title, respectively
Title, firstName, and lastName contain only strings:
Elements that contain numbers, strings, dates, and so forth (TBD) but no
subelements (or attributes) are said to have simple types
ISBN and chapter carry attributes; author has subelements:
Elements that contain subelements or carry attributes are said to have
complex types
Attributes always have simple types (that is, they are numbers, strings, dates, and so forth.
TBD In a later chapter we describe XML Schemas which have access to
a collection of built-in simple types
Trang 7Basics of Well-formed XML (1 of 2)
XML documents are considered to be well-formed when they
adhere to a set of five rules that define basic XML syntax and structure + a sixth for worldwide conformity
1 There must be a single root element:
All other elements are nested inside the root element
2 Elements must be properly terminated:
For every opening tag "< >" there must be a matching closing tag
"</ >"
The exception is an empty (no content or body) tag "< />"
3 Elements must be properly nested underneath a parent tag (except for the single, root element):
A nested tag-pair may not overlap another tag
There is no limit to the nesting level of children elements
Trang 8Basics of Well-formed XML (2 of 2)
4 Tag names are case sensitive:
All tag and attribute names, attribute values, and data must comply with XML naming rules.
5 Attributes, extra information that can be provided for elements,
must be properly quoted:
That is, all attribute values must be in quotes.
6 The first line should/must contain the special tag that identifies the version of the XML specification to apply:
XML 1.0 is currently the most common.
Trang 9Element Rules - Rule 1 Single Root Element
All XML documents must have a single root element
<?xml version="1.0"?>
<colors>
<color> red </color>
<color> green </color>
</colors>
<?xml version="1.0"?>
<color> red </color>
<color> green </color>
Colors is the root element for
this XML Color represents multiple root elements.
Trang 10Element Rules - Rule 2 Element Tag Rules
Elements consist of start and end tags
End tag is identified by the /
Example: <color> red </color>
Elements may contain attributes within the start tag
Example: <book isbn=" 34323 "></book>
Note: The attribute is isbn
Empty elements contain no child elements or data
These elements can be represented with a special shorthand notation
Example:
<record key=" 123 "></record>
Can be shortened to:
<record key=" 123 " /> (preferred)
Or, if the element has no data as: <record />
Trang 11Element Rules - Rule 3 Element Nesting
Elements must be properly nested
The end tags of inner elements must occur before the end tags of outer elements
Any number of child elements or data may be nested within the start and end tags of an element
Trang 12Element Nesting Example
<?xml version="1.0"?>
<shirt>
<style> Polo </style>
<color> red </color>
<size> large </size>
</style>
</size></color>
</shirt>
All elements are properly nested The element tags are mixed up and not ordered.
Best Practice:Use indentation to represent the document's hierarchy.Important if your document will likely be read by humans
Computers and programs don't usually care
Trang 13Element Rules - Rule 4 XML Naming Rules
XML name construction:
The first character must be A-Z, a-z, or _ (underscore)
Any number of subsequent letters, numbers, hyphens,
periods, colons, and underscore characters.
XML names are case sensitive.
Names cannot contain spaces.
Names must not have a prefix of xml in any case combination (such names are reserved).
Best Practice: Brevity in tag names is not necessary.
Use descriptive names for elements and attributes.
<Queue> or <que> is far better than <q>.
Best Practice: Maintain standard naming conventions and quoting.
Camelback, dot and underscore notation are all common (For example, camelBackNotation, dot.notation, and
underscore_notation).
Trang 14Rule 4 Tag Naming - Samples
Trang 15Rule 4 Element Content (1 of 2): General
An XML instance is composed of elements expressed in tag pairs (except for empty tags) plus optional attributes that always have quoted values and optional data that appears between the element
start tag and the element end tag
Mixed content - element content that contains data (PCDATA is shown) and other elements
Example (snippet):
<title><ref> XML </ref> Example </title>
<chapter>
Chapter information
<para> What is XML </para>
<para> What is HTML </para>
More chapter information
</chapter>
Trang 16Rule 4 Element Content (2 of 2): Data
Element data content is handled in one of two ways:
1 Parsed Character Data (PCDATA): is examined by the XML parser to discover XML content embedded within it
2 Character Data (CDATA): is delimited by the special syntax
<![CDATA[ ]]> and is not processed by the parser
Trang 17Rule 4 PCDATA - Parsed Character Data
Predefined entities exist to address ambiguous syntax situations, situations where the literal would be interpreted as part of the XML document syntax rather than its content
Examples:
<range> > 6 & < 20 </range>
<quotes characters="' " '"/>
Entity Description Character
< "less than" <
> "greater than" >
& "ampersand" &
' "apostrophe" '
" "quote" "
Trang 18Rule 4 CDATA - Character Data
Syntax:
Note: Anything except the literal string "]]>";
to embed "]]>" use "]]>"
CDATA is not parsed and is treated as-is
Useful for embedding other languages within the XML
HTML documents
XML documents
JavaScript source
Or any other text with a lot of special characters
Generally speaking the escaping rules inside a CDATA section are those of the embedded language
For example, to escape an ampersand in Javascript use &
Trang 19Rule 4 CDATA Examples
These script elements contain JavaScript:
This nameXML element stores actual XML to be treated as text:
{ return 1 } else
{ return 0 } }
]]></script>
<nameXML>
<![CDATA[
<name common="freddy" breed="springer-spaniel">
Sir Frederick of Ledyard's End
</name>
]]>
</nameXML>
Trang 20Element Rules - Rule 5 Element Attributes
Attributes are used to attach information to elements
Attributes consist of a name="value" pair, where the name is a legal XML name This is often referred to as a "key-value" pair
Attributes are placed in the start tag of the element to which they apply
An element may have several attributes, each uniquely named
Examples:
<title type="section" number="1" >XML overview</title>
<title type="boat" state="FL" >Yacht</title>
Notice the different usage of the attribute "type" in the two elements; semantically they are not the same
Attributes must have a value
Values must be quoted with either double or single quotes
Convention is to stick with one or the other
Trang 21Element Rules - Rule 6
XML Declaration (1 of 2)
The XML Declaration is an optional first line in all XML documents:
<?xml version= "1.0" ? >
<?xml version= "1.0" encoding= "UTF-8" ?>
<?xml version= "1.0" standalone= "yes" ?>
If this declaration is used, the version attribute is mandatory.
The encoding attribute indicates the character encoding used in the
document; if UTF-8 or UTF-16 is used it may be omitted.
ASCII is a subset of UTF-8 and need not be declared.
Comments are not allowed before this statement.
The XML Declaration follows the syntax of a Processing Instruction or PI,
which is described on a subsequent chart, but it is considered to be unique and is treated separately in the 1.0 XML specification.
GENERAL NOTE OF CAUTION: You can not always rely on a browser or tool to completely/correctly enforce the specifications Nor are the
specifications always written in language that, to a particular reader, is unambiguous Still, the best advice is when in doubt, refer to the
specification, which for XML is www.w3.org/XML.
Trang 22The stand-alone attribute is included here for completeness: it is used to
indicate if this XML document depends on information declared externally to
this document (in a DTD or XSL file (TBD), for examples); value may be yes
or no.
A value of "yes" indicates there are no external markup declarations; if there are no external markup declarations, the declaration has no
meaning.
A value of "no"indicates there are or may be such external markup
declarations; if there are such declarations but there is no standalone declaration, "no" is assumed.
so it is typically not used.
In any event, the inclusion in the XML instance of references to external entities, such as those in an embedded DTD, does not change its
standalone status.
A bigger issue associated with the stand-alone attribute is that of defining or
setting values in any entity that may be external to the XML instance
Arguably, the principal reason for using XML is that it explicitly defines the elements it includes If attribute values are overridden then the XML
instance before us is no longer declarative.
Element Rules - Rule 6
XML Declaration (2 of 2)
Trang 23<! > Defines a comment
A space after the beginning and before the trailing hyphens is
recommended but not required
<?xml version=" 1.0 "?>
<! This is a comment They can go anywhere
inside an XML document except within an element tag.
>
<book>
<! Here is another comment >
</book>
Improper usage:
<chapter <! comment > > Some text </chapter>
or before the XML Declaration statement.
Trang 24Internationalization and Encoding (1 of 2)
Support for different character encodings is provided through the encoding attribute of the XML Declaration
<?xml version="1.0" encoding="charset"?>
The encoding attribute indicates the set of characters that are permitted in the document
In the absence of an encoding declaration, Unicode UTF-8 or
UTF-16 characters may be used
Documents exchanged via network may be presented to the
processor in an encoding format other than the specified encoding
as long as the transport protocol (for example, HTTP) indicates the encoding used
Trang 25Internationalization and Encoding (2 of 2)
It is very important that the editor and operating system used to write and save an XML document support the encoding specified in the XML Declaration
Sample encoding declarations:
ASCII (subset of UTF-8)
<?xml version="1.0" encoding="ISO-8859-1"?>
16 bit UNICODE
<?xml version="1.0" encoding="UTF-16"?>
<?xml version="1.0" encoding="ISO-10646-UCS-2"?>
Trang 26Processing Instruction
Syntax <? target arg*?>
Processing Instruction is often abbreviated as PI in
documentation
A feature inherited from SGML
Used to embed application-specific instructions in documents.The target name immediately follows "<?" and is used to associate the PI with an application
May include zero or more arguments
May be preceded by comments
For example, <?xml-stylesheet href="common.css" type="text/css"?>, which is a generally available stylesheet for simple formatting
Trang 27Well-formed versus Valid
A well-formed XML document:
Consists of XML elements that are nested within another
Has a unique root element
Follows the XML naming conventions
Follows the XML rules for quoting attributes
Has tags that are properly terminated
All XML parsers check for well-formedness
A valid XML document has an associated vocabulary and obeys the
structural rules specified by that vocabulary
Associated vocabulary is typically defined by either a DTD or an XML Schema
XML parsers may be validating or non-validating depending upon whether or not they can apply an associated grammar
Studio is an example of a tool whose XML capabilities include validation