A SAX parser reads an XML document from the beginning, and the parser tells an application what it finds by using the callback methods of ContentHandler or other interfaces.. 5.2.1 Conte
Trang 1Section 5.1 Introduction
Section 5.2 Basic Tips for Using SAXSection 5.3 DOM versus SAX
Section 5.4 Summary
Trang 2Unlike DOM, the SAX specification is not authorized by W3C.SAX was developed through the xml-dev mailing list, the largestcommunity of XML-related developers The development of SAXwas finished in May 1998 SAX 2.0, which introduced
namespace support and the feature/property mechanism, wascompleted in May 2000
As described in Chapter 2, SAX is an event-based parsing API.Its methods and data structures are much simpler than those ofDOM This simplicity implies that application programs based onSAX are required to do more work than those based on DOM
On the other hand, SAX-based programs can often achieve highperformance
In this chapter, we describe some tips for using SAX Then wecompare DOM and SAX, and introduce sample programs usingDOM and SAX
Trang 3In Chapter 2, Sections 2.4 (see Figure 2.2) and 2.4.2 describethe basic concepts of SAX and the programming model for SAX.The concept of SAX is simple A SAX parser reads an XML
document from the beginning, and the parser tells an
application what it finds by using the callback methods of
ContentHandler or other interfaces
However, there are some things you should know We discussthem in this section
5.2.1 ContentHandler
In this section, we discuss a major trap for beginning users ofSAX and the parser feature mechanism, an important featureintroduced in SAX2
Trap of the characters() Events
The characters() method of ContentHandler confuses SAXbeginners Consider the following document:
startDocument()
Trang 4characters(): "\n Hello,\n XML & Java!\n" endElement() for the root element
Trang 5Stack context;
public TextMatch(String pattern) {
this.buffer = new StringBuffer();
Trang 6}
public void processingInstruction(String target, String data) throws SAXException {
// Nothing to do because PI does not affect the meaning
// of a document.
}
public void startElement(String uri, String local,
Trang 7try {
XMLReader xreader = XMLReaderFactory.createXMLReader( "org.apache.xerces.parsers.SAXParser");
Trang 8TextMatch finds "XML & Java" in the book element, the
Trang 9NAMESPACE-NS URI/LOCAL NAME
QUALIFIED NAME
CALLS
*PrefixMapping()
Trang 10-true true x x
Basically, you need not disable the namespace feature Turn itoff only when the slight overhead of this feature is
unacceptable Turn on the namespace-prefix feature if you needqualified names or namespace declarations as attributes
According to the JAXP specification, a SAX parser created bySAXParserFactory is not namespace-aware by default In theJAXP implementation of Xerces,
SAXParserFactory.setNamespaceAware() affects the
setting of the namespace feature As for Crimson in the JAXP1.1 reference implementation,
SAXParserFactory.setNamespaceAware() seems to affectneither the namespace feature nor the namespace-prefix
feature We recommend that you always get an XMLReaderinstance by using SAXParser.getXMLReader() and that youset these features explicitly
5.2.2 Using and Writing SAX Filters
A SAX filter receives SAX events from a SAX parser, modifies
these events, andforwards them to a handler, as shown in
Figure 5.1 As far as the SAX parser is concerned, the SAX filtercan be seen as a handler On the other hand, as far the handler
is concerned, the SAX filter can be seen as a SAX parser
Figure 5.1 SAX filter
Trang 11interface for SAX parsers
Typical uses of SAX filters are the following
Modifying XML documents
When you write a program for modifying XML documents, youmight want to reuse XMLSerializer for serializing SAX events
to an XML document Then you only have to write a SAX filterthat modifies SAX events, and insert the filter between a SAXparser and XMLSerializer
implementing a SAX filter that concatenates consecutive
characters() events
Trang 12Suppose that you want to use two handlers for a single XMLdocument at the same time Unfortunately, you cannot registertwo or more handlers of the same type to one XMLReader
instance So you implement a handler as a SAX filter (see
Figure 5.2), or you make a filter that accepts the registration oftwo handlers and duplicates the input events (see Figure 5.3.)
Figure 5.2 A handler performs as a filter.
Figure 5.3 A filter duplicates events.
Trang 13A typical code fragment for using a SAX parser follows
XMLReader parser = XMLReaderFactory.createXMLReader(); // or parser = new SAXParser() if you use Xerces.
Trang 14interface by adding getParent() and setParent() The
Trang 15Listing 5.3 is an example of a SAX filter It replaces elementslike <email>foo@example.com</email> with
Trang 16*/
public void startElement(String uri, String local, String qname, Attributes atts)
Trang 17return null So you have to check whether the next handler is
Trang 18</addresses>
5.2.3 New Features of SAX2
In this section, we summarize the new features of SAX2 fordevelopers who have experience with SAX1
Namespace support
SAX1 was finalized before the "Namespace in XML" specificationbecame a W3C Recommendation So SAX1 has no namespacesupport With SAX2, applications can receive namespace
information as described in Section 5.2.1
SAX filters
SAX1 has no interface for filters, though we can write filterswithout such an interface SAX2 introduced a standard
XMLFilter interface It makes writing and using filters easier
More information about an XML document
With SAX1, applications can know nothing about comments,CDATA sections, and many types of declarations in DTDs SAX2supports them with new interfaces
Feature/property mechanism
SAX2 provides a generic mechanism to enable or disable thefeatures of SAX parsers and to set or get extra information
Trang 19Name changes to classes and interfaces
Some interfaces of SAX1 were made obsolete by SAX2 Werecommend using the SAX2 interfaces even if you don't needthe new features of SAX Table 5.2 summarizes the name
changes
Table 5.2 Interface Changes between SAX1 and SAX2
Parser XMLReader Support of new
interfacesParserFactory XMLReaderFactory Support of new
interfacesDocumentHandler ContentHandler Support of namespaceHandlerBase DefaultHandler Support of new
interfacesAttributeList Attributes Support of namespaceAttributeListImpl AttributesImpl Support of new
interfacesN/A DeclHandler Receive declarations in
DTDsN/A LexicalHandler Receive lexical
information such ascomments and CDATAsections
N/A XMLFilter New filter interface
Trang 20We discussed the basic concepts of DOM and tips for using DOM
in Chapter 4 and discussed those of SAX in the previous
section In Section 2.4.3, we discussed points for deciding
whether to use DOM or SAX In this section, we compare theperformance of DOM and SAX and study the conversion of DOMfrom and to SAX
5.3.1 Performance: Memory and Speed
In this section, we compare the performance of DOM and SAXbased on memory usage and on parsing speed
Memory Usage
First, we compare the memory usage of DOM and SAX We canguess that SAX uses less memory than DOM
We use the XML document shown in Listing 5.4 Its size is 348bytes
Trang 21public static void main(String[] argv) throws Exception { String xml = argv[0];
Trang 24"Deferred DOM," we call DocumentImpl without deferred DOM
"Non-deferred DOM," and we call CoreDocumentImpl "Core
R:\samples> java chap05.MemoryUsageDOM org.apache.xerces.dom.
CoreDocumentImpl false file:./chap05/memtest10.xml
104776,155584,278472,280792,324416,327032,329664,291320,334944,337560, 340192,301848
Trang 25document The second invokes Non-deferred DOM and usesabout 2.62KB for one document The third invokes Core DOMand uses about 2.60KB for one document
Figure 5.5 shows the memory usage of SAX, Deferred DOM,Non-deferred DOM, and Core DOM
Figure 5.5 Memory usage for SAX and DOM
implementations
For Non-deferred DOM or Core DOM, the amount of memoryused increases in proportion to the number of nodes in a
document For Deferred DOM, the amount of memory used isnot proportional It does not use 220KB for a document twice aslarge Table 5.3 shows the memory usage for documents
containing 10, 100, 200, 300, 400, or 500 child nodes
This result indicates that Deferred DOM wastes much memory
In fact, Deferred DOM defers creating DOM nodes in order toimprove not memory performance but parsing speed In
general, object creation in Java cost much time, and reducingobject creation (new operators) is very effective for improving
Trang 26public class SpeedTest {
Trang 27"http://apache.org/xml/properties/dom/document-class-name"; static final String FEATURE_DEFER =
Trang 29domp.setProperty(PROP_DOC,
"org.apache.xerces.dom.CoreDocumentImpl"); for (int i = -1; i< n; i++) {
Trang 30Non-deferred DOM: 12748ms
Core DOM: 12648ms
R:\samples> java chap05.SpeedTest 500 true file:./chap05/memtest500.xml SAX: 11036ms
Trang 31Because a serializer accesses all nodes in the DOM tree, all
nodes are eventually created even when Deferred DOM doesnot create them during parsing In fact, Deferred DOM is theslowest in parsing combined with serialization
5.3.2 Conversion from DOM to SAX and Vice Versa
As described earlier, the runtime performance of SAX is alwaysbetter than that of DOM However, application development withSAX only is a hard job Converters from DOM to SAX and viceversa would be useful
In this section, we introduce DOMReader, which throws SAXevents from a DOM tree, and DOMConstructor, which creates
a DOM tree from SAX events
Trang 32DOMReader traverses an input DOM tree and generates
corresponding SAX events It is derived from XMLReader,
which is the SAX parser interface, because it generates SAXevents However, the input to DOMReader is a DOM node,
though the input to XMLReader is InputSource or a URI
Thus, DOMReader ignores the parameters of the parse()
method and receives the input DOM via the setProperty()method
The core of DOMReader is the processNode() method, whichgenerates corresponding SAX events from various types of DOMnodes It is not difficult to understand this method if you arefamiliar with DOM and SAX Because there are no ways to
Trang 36// text is ignorable or not.
chars = node.getNodeValue().toCharArray();
this.chandler.characters(chars, 0, chars.length); break;
Trang 37xreader.setContentHandler(mon);
Trang 39characters: length=11 '\n aaa\n '
Trang 40programming models of DOM and SAX
Because both normal character data and CDATA sections arerepresented by characters() events, we cannot distinguishCDATA sections from characters() events by examining
characters() events only To distinguish CDATA sections, we
Trang 41CDATA sections or not by checking startCDATA() and
endCDATA() As for entity references, we also have to checkstartEntity() and endEntity() to know whether or not aparser is processing an entity reference
Methods such as startCDATA(), startEntity(), and
comment() are methods of the LexicalHandler interface Sothey are not called if a SAX parser does not support
LexicalHandler or an application does not register a
DOMConstructor instance as a LexicalHandler() to a SAXparser In this case,
No Comment nodes are generated
No EntityReference nodes are created and the contents
of entity references are appended directly
Text nodes are generated instead of CDATASection
nodes
They do not change the meaning of an XML document, thoughthey change the lexical representation of the XML document.The type of an output node of DOMConstructor depends onthe input SAX events We get a Document node if the input SAXevents start with startDocument() and end with
endDocument() Meanwhile, we get an Element node if theinput SAX events start with startElement() and end withendElement() To convert part of an XML document to a DOMtree, you can create a SAX filter to discard unnecessary events.See Listing 5.10
Listing 5.10 Convert SAX events to a DOM tree,
chap05/DOMConstructor.java
Trang 42this.factory = factory;
Trang 43protected void flushText() {
if (this.buffer == null || this.buffer.length() == 0) return;
String text = new String(this.buffer);
if (this.inCdata) {
Trang 44}
public void processingInstruction(String target, String data) throws SAXException {
this.flushText();
ProcessingInstruction pi;
pi = this.factory.createProcessingInstruction(target, data); this.output(pi);
Trang 46Element elem = this.factory.createElementNS(uri, qname); for (int i = 0; i < atts.getLength(); i++) {
}
public void endEntity(String name) throws SAXException {
Trang 47The program SAX2DOM (see Listing 5.11) is an example of
converting SAX events to a DOM tree with DOMConstructor It
Trang 49The result of running SAX2DOM follows We can see that two
identical tree structures are created
R:\samples>java chap05.SAX2DOM file:./chap05/nstest.xml - DOM -
Trang 50#text
Trang 51In this chapter, we discussed some tips for using SAX and SAXfilters Then we compared the performance of DOM and SAXand described converting from DOM to SAX and from SAX toDOM
In Chapter 6, we provide general tips on using XML processorsand Xerces and discuss the new Xerces2 architecture
Trang 52Section 2.5 Summary
Trang 53In the previous sections, we showed how to read and parse anXML document Next, we explain how to process an XML
document by accessing its internal structure through APIs
The XML 1.0 Recommendation defines the precise behavior of
an XML processor when reading and parsing a document, but itsays nothing about which API to use In this section, we discusstwo widely used APIs
The Document Object Model (DOM), a tree structure–based
API by W3C The specification consists of Level 1
(Recommendation in October 1998), Level 2 (Recommendation
in November 2000), and Level 3 (currently a Working Draft)documents Xerces 1.4.3 supports most of DOM Level 2
The Simple API for XML (SAX), an event-driven API
developed by David Megginson and a number of people on thexml-dev mailing list Although not sanctioned by any standardsbody, SAX is supported by most of the available XML
processors Xerces 1.4.3 supports SAX and SAX2, which
supports namespaces In this book, the word "SAX" refers toSAX (version 1.0) and SAX2
Figure 2.2 depicts the difference between the DOM and SAXAPIs When an application uses a DOM-based parser, it parses
an XML document and passes a Document instance The
application should wait until it parses the whole XML document.When an application uses a SAX-based parser, it starts parsing
an XML document and passes an event stream to the
application in the course of parsing The next sections discuss indetail the pros and cons of using these APIs
Figure 2.2 DOM versus SAX
Trang 54In SAX2, some interfaces have been changed andrenamed to support namespaces Xerces supportsboth the SAX and SAX2 APIs, but the old SAXinterfaces are now deprecated
2.4.1 DOM: Tree-Based API
Trang 55document.forms(1).username.value refers to the value ofthe input field with the name username in the first form
element in an HTML document This expression is used to
access the HTML DOM on HTML browsers like Microsoft InternetExplorer (IE) and Netscape Navigator
However, current HTML object models and APIs to access themare browser-dependent (though the problem is being resolved).Thus you generally should prepare different pages suited foreach type of browser that might execute your scripts One goal
of the DOM specification is to define a common, interoperabledocument object model for HTML as well as XML The first
edition of this book is based on the DOM Level 1
Recommendation The DOM Level 2 Recommendation was
published on November 13, 2000 Handling of namespaces,events, traversal range, and views were introduced in DOMLevel 2 Standardization of DOM Level 3 is in progress It willsupport load and save functions and other new functions Thedetails of using the DOM API are discussed in Chapter 4
In DOM, an XML document is represented as a tree whose
nodes are elements, text, and so on An XML processor
generates the tree and hands it to an application A DOM-basedXML processor (for example, DOMParser or
DocumentBuilder) creates the entire structure of an XML
document in memory (though Xerces defers the creation ofDOM nodes until it is accessed)
XML is a language for describing tree-structured data In XML,
an element is represented by a start tag and a matching endtag (or an empty-element tag) An element may contain one or