XML can be used to store data inside HTML documents.. The following code is legal in HTML: This is a paragraph This is another paragraph In XML all elements must have a closing tag like
Trang 1How XML can be used
XML can keep data separated from your HTML
HTML pages are used to display data Data is often stored inside HTML pages With XML this data can now be stored in a separate XML file This way you can concentrate on using HTML for formatting and display, and be sure that changes
in the underlying data will not force changes to any of your HTML code
XML can be used to store data inside HTML documents
XML data can also be stored inside HTML pages as Data Islands You can still concentrate on using HTML for formatting and displaying the data
XML can be used as a format to exchange information
In the real world, computer systems and databases contain data in incompatible formats One of the most time consuming challenges for developers has been to exchange data between such systems over the Internet Converting the data to XML can greatly reduce this complexity and create data that can be read by different types of applications
XML can be used to store data in files or in the databases
Applications can be written to store and retrieve information from the store, and generic applications can be used to display the data
XML Example
<?xml version="1.0"?>
<note>
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!!!</body>
</note>
Line-by-line code Explanation
<?xml version="1.0"?>
The XML declaration should always be included It defines the XML version of the document In this case, the document conforms to the 1.0 specification of XML
<note>
Defines the first element (the root element) of the document
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!!!</body>
Defines 4 elements of the root (to, from, heading and body)
</note>
The last line defines the end of the root element
Trang 2What's an XML doc looks like
Let's save a piece of code above as note.xml (by the way, XML document should have xml as it's extension) and open it in the IE Below is what you actually see
on the browser
XML Syntax - General Idea
1 All XML elements must have a closing tag
In HTML some elements do not have to have a closing tag The following code is legal in HTML: <p>This is a paragraph
<p>This is another paragraph
In XML all elements must have a closing tag like this: <p>This is a
paragraph</p>
<p>This is another paragraph</p>
2 XML tags are case sensitive
XML tags are case sensitive Opening and closing tags must therefore be written with the same case
<Message>This is incorrect</message>
<message>This is correct</message>
Important: Tags should begin with either a letter, an underscore (_) or a colon ( followed by some combination of letters, numbers, periods (.), colons,
underscores, or hyphens (-) but no white space, with the exception that no tags should begin with any form of "xml" It is also a good idea to not use colons as the first character in a tag name even if it is legal Using a colon first could be confusing Here are some legal and illegal tags examples: Legal tags Illegal tags
<first-name> <first - name>
<last.name> <last name>
<namexml> <xmlname>
3 All XML elements must be properly nested
In HTML some elements can be improperly nested within each other like this:
<b><i>This text is bold and italic</b></i>
In XML, all elements must be properly nested within each other like this:
<b><i>This text is bold and italic</i></b>
4 All XML documents must have a root tag
All XML documents must contain a single tag pair to define the root element All other elements must be nested within the root element All elements can have sub (children) elements Sub elements must be in pairs and correctly nested
Trang 3within their parent element eg <root>
<child>
<subchild>
</subchild>
</child>
</root>
5 Attribute values must always be quoted
XML elements can have attributes in name/value pairs just like in HTML In XML the attribute value must always be quoted eg
Correct Incorrect
<?xml version="1.0"?> <?xml version="1.0"?>
<note date="25/06/00"> <note date=25/06/00>
Avoid using attributes?
Attributes are handy in HTML But in XML you should try to avoid them (you could easily substitute attributes by elements - I will show you later so you could get the idea!!!) Why? Below are some of the problems using attributes
Attributescan not contain multiple values
Attribute are not expandable
Attribute are more difficult to manipulate by program code
Attribute values are not easy to test against DTD
Let me clear up your doubt by looking at the following example:
An XML example
<?xml version="1.0"?>
<note>
<date>12/11/00</date>
<to>Tove</to>
<from>Jani</from>>
<subject>Reminder</subject>
<body>Don't forget me this weekend</body>
</note>
If you look at the element <date> above, how do you interpret it??? Is this 12 of November or 11 of December???
Now, let see how you can expand the <date> element:
<?xml version="1.0"?>
<note>
<date>
<date>12</date>
<month>11</month>
<year>99</year>
Trang 4</date>
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend</body>
</note>
Got the idea???
XML Well-formed
If you have read the XML Syntax - General Idea section above, by now you should have a very fair idea about XML in general So, I am going to move on to more interesting topic that is XML well-formed XML Documents consider well formed should satisfy three simple rules:
The document must contain one or more elements
It must contain a uniquely name element, no part of which appears in the content
of any other element, known as the root element
All other elements within the root element must be correctly nested
So, according to these rules, the following are examples of well formed
documents:
example1.xml
<empty_tag></empty_tag>
example2.xml
<?xml version="1.0"?>
<class>Mammalia</class>
example3.xml
<root>
<class>Mammalia</class>
</root>
example4.xml
<empty_tag/>
Note: example1.xml and example4.xml are the same
The following is example of not well formed documents:
bad_example.xml
<bad_parent>
<naughty_child>Some text info
</bad_parent>
</naughty_child>
Trang 5Explanation: If you look carefully, you can see that the element <naughty_child> overshoots the end of the <bad_parent> element, which should encapsulate the
<naughty_child> element completely (According to rule 3 above)
XML doc structure
Physically, documents are composed of a set of entities (we will talk about this topic in a bit) that are identified by unique names All documents begin with a root
or document entity All other entities are optional
As opposed to physical structure, XML documents have a logical structure as well Logically, documents are composed of declarations, elements, comments, character references and processing instructions, all of which are indicated in the document by explicit markup
Data vs Markup
All XML documents may be understood in terms of the data they contain and the markup that describe that data
Data is typically "character data" (i.e anything within the boundaries of valid Unicode such as letters, numbers, punctuation and so on) but can also be binary data as well
Markup includes tags, comments, processing instructions, DTDs and references and so forth
For example: <name>John Smith</name>
Explanation: <name>and </name> tags comprise the markup and "John Smith" comprises the character data
XML Declaration
To begin an XML document, it is a good idea to include the XML declaration at the very first line of the document Though the XML declaration is optional, but the W3C specification (World Wide Web Consortium - the group developed XML) suggests that we should include it to indicate the version of XML, used to
construct the document so that an appropriate parser or parsing process can be matched to the document
Essentially, the XML declaration is a processing instruction that notifies the processing agent (browser) that the following document has been marked up as
an XML document It will look something like the following:
<?xml version="1.0"?>
Trang 6OR having a white space in between as shown below.
<?xml version = "1.0" ?>
We will talk more about the gory details of processing instructions later, for now we concentrate on explaining how the XML declaration works okie!
All processing instructions, including the XML declaration should have the
following syntax:
<?name ?>
It must begin with <? and end with ?> Following the initial <?, you will find the name of the processing instruction, which in this case is "xml"
The XML processing instruction, requires that you specify a version attribute and allows you to specify optional standalone and encoding attributes
In its full regalia, the XML declaration might look like the following:
<?xml version="1.0" standalone="yes" encoding="UTF-8"?>
The Version Attribute
As we have mentioned before, if you do decide to use the optional XML
declaration, you must define the version attribute As of this writing, the current version of XML is 1.0
If you include the optional attributes, version must be specified first
The STANDALONE Attribute
The standalone attribute specifies whether the document has any markup
declarations that are defined in a separate document Thus, if standalone is set
to "yes", the document is effectively self-contained and there are no extra markup declarations in external DTD's However, setting the standalone to "no" leaves the issue open Remember that the document may or may not access external DTD's
For examples:
standalone_yes.xml
<?xml version="1.0" standalone="yes" encoding="UTF-8"?>
<book>
<title>Professional XML Design and Implementation</title>
<author>Paul Spencer</author>
Trang 7<publisher>Wrox</publisher>
<price>$83.95</price>
</book>
standalone_no.xml
<?xml version="1.0" standalone="no" encoding="UTF-8"?>
<!DOCTYPE book SYSTEM "book.dtd">
<book>
<title>Professional XML Design and Implementation</title>
<author>Paul Spencer</author>
<publisher>Wrox</publisher>
<price>$83.95</price>
</book>
Note: As you can see, if standalone="no" which means the XML document should use an external DTD In this case, use book.dtd file to check for validating document
The ENCODING Attribute
All XML parsers must support 8-bit and 16-bit Unicode encoding (UTF-8 and UTF-16 respectively) corresponding to ASCII However, XML parsers may support a larger set
Character Data
XML defines the text between the start and end tags to be character data and the text within the tags to be markup
Since the "<" and ">" are the reserved characters for the start and end of a tag - respectively Thus character data may be any legal (Unicode) character except the "<" and ">" can not be used The following example is incorrect
<comparison>12 < 13</comparison>
Alternative solution:
<comparison>12 < 13</comparison>
Here is the question you might ask yourself How am I supposed to know which characters that legal or illegal to use? Well, not too worries - XML provides a couple of useful entity references that you can use:
Character Entity Reference Meaning
> > Greater than
< < Less than
& & Ampersand
" " Double quote
Trang 8' ' Apostrophe (Single quote)
Obviously, the < entity reference is useful for character data The other entity references can be used within markup in cases in which there could be confusion such as:
<statement value="She said, "Don't go there!"">
Which should be written as:
<statement value="She said, "Don't go there!"">
By and large, tags make up the majority of XML markup A tag is pretty much anything between a < sign and a > sign that is not inside a comment, or a
CDATA section (Read on next section, please!)
CDATA
CDATA also means character data CDATA is text that will NOT be parsed by a parser Tags inside the text will NOT be treated as markup and entities will not be expanded
As we have said already, it is a pretty good rule of thumb to consider anything outside of tags to be character data and anything inside of tags to be considered markup But alas, how am I going to show the ">" or any other reserved
characters on the browser? and worse still, if I decide to have lots of reserved characters to show up the browser, do I have to key in all the funny entity
reference symbols???
Of course not, XML has provided you a wonderful feature that you can use That
is the special case of CDATA blocks, it is provided as a convenience measure when you want to include large blocks of special characters as character data
By including CDATA block, you actually tell the XML processor (browser) to treat everything inside CDATA section just like any others ordinary character data (that means all tags and entity references are ignored by an XML processor) Let's say you want to display XML document on the browser, you can construct your XML document as follow:
<example>
<document>
<name>Trina Thach</name>
<email>trina@technomusic.org</email>
</document>
</example>
As you can see, you would be forced to use entity references for all the tags
Trang 9What's a mess!
Thus, to avoid the inconvenience of translating all special characters, you can use a CDATA block to specify that all character data should be considered character data whether or not it looks like a tag or entity reference
Now, allow me to show you how easy it is by applying CDATA block within XML document:
<example>
<![CDATA[
<document>
<name>Trina Thach</name>
<email>trina@technomusic.org</email>
</document>
]]>
</example>
See how readable and legible it is???
As you might have guessed, the character string ]]> is not allowed within a CDATA block as it would signal the end of the CDATA block
PCDATA
PCDATA means parsed character data Think of character data as the text found between the start tag and the end tag of an XML element PCDATA is the text that will be parsed by a parser Another word, the tags inside the text will be treated as markup and entities will be expanded
Comments
Not only will you sometimes want to include tags in your XML document that you want the XML processor will ignore (display as character data), but sometimes you will want to put character data in your document that you want the XML processor to ignore (not display at all) This type of text is called Comment text
In HTML, you specified comments using the <! and > syntax Well, I have some good news In XML, comments are done in just the same way! So the following would be a valid XML comment:
<! Begin the Names >
<name>Jim Nelson</name>
<name>Sam Sanger</name>
<name>Les Moore</name>
<! End the names >
When using comments in your XML documents, however, you should keep in mind a couple of rules
Trang 10Should not have "-" or " " within the text of your comment as it might be
confusing to the XML processor
Never ever place a comment within a tag Thus, the following code would be poorly-formed XML
<name <! The name > >Peter Williams </name>
Likewise, never place a comment inside of an entity declaration and never place
a comment before the XML declaration that must always be the first line in any XML document
Comments can be used to comment out tag sets Thus, in the following case, all the names will be ignored except for Barbara Tropp
Processing Instructions
We have already seen a processing instruction The XML declaration is a
processing instruction And if you recall, when we introduced the XML declaration
we promised to return to the concept of processing instructions to explain them
as a category
So here we are
A processing instruction is a bit of information meant for the application using the XML document That is, they are not really of interest to the XML parser Instead, the instructions are passed intact straight to the application using the parser
The application can then pass this on to another application or interpret it itself All processing instructions follow the generic format of:
<?name_of_app_instruction_is_for_instructions?>
As you might imagine, you cannot use any combination of "xml" as the
name_of_application_instruction_is_for since "xml" is reserved However, you might have something like:
<?JAVA_OBJECT JAR_FILE= "/java/myjar.jar"?>
XML Syntax - Entities
Actually I should have leave this topic till we talk about writing the valid
documents rather than writing well-formed documents Nevertheless, some issues make sense within this section, because entities must be well-formed as well as valid So, what I am going to do is to introduce entities in terms of their