Copyright 2001, ActiveStatePython and XML... Copyright 2001, ActiveStateAbout me • Paul Prescod, paul@activestate.com • ActiveState Senior Developer • Co-Author, XML Handbook... Copyrigh
Trang 1Copyright 2001, ActiveState
Python and XML
Trang 2Copyright 2001, ActiveState
About me
• Paul Prescod, (paul@activestate.com)
• ActiveState Senior Developer
• Co-Author, XML Handbook
Trang 3Copyright 2001, ActiveState
Trang 4Copyright 2001, ActiveState
What is Python?
• Python is an easy to learn, powerful
programming language
– Efficient high-level data structures
– Simple approach to object-oriented programming
– Elegant syntax and dynamic typing
Trang 5Copyright 2001, ActiveState
Brief History of Python
• CWI, early 90s
• Dynamic Object Oriented High Level Language
• More than a text processing language
• More than a scripting language
• Scalable and object oriented from the beginning
• Dynamically type checked
Trang 6Copyright 2001, ActiveState
Python's business case
• Python can displace many other
languages in the organization
• The Python interpreter is free
• Python is legally unencumbered
• Professional programmers find Python more flexible than most languages
• Amateur programmers are (often) more comfortable than with Perl or Java
Trang 7Copyright 2001, ActiveState
Usability features
• Exceptionally clear syntax
• Provides an obvious way to do most things
• Small set of features combine in
powerful ways
• Only innovative where innovation is
really necessary
Trang 8Copyright 2001, ActiveState
More Usability features
• Huge amount of free code and libraries
• Interactive
• Designed to talk to the world
• Runs with Unix, Mac and Windows
• Integrates with JVM (Jython) and NET Framework (Python.NET)
• Talks MS COM, XPCOM,
CORBA,SOAP, XML-RPC, …
Trang 9Copyright 2001, ActiveState
Scalability features
• Simple but powerful module system
• Simple but powerful class system
• Structured, standardized exceptions
Trang 10Copyright 2001, ActiveState
Trang 11Copyright 2001, ActiveState
Extendable
• New data types in Python or C
• Modules in Python or C
• Functions in Python or C
Trang 12Copyright 2001, ActiveState
Python isn't picky!
Trang 13Copyright 2001, ActiveState
Trang 14Copyright 2001, ActiveState
Compared to Java
• Java is more difficult for amateur
programmers
• Static type checking can be
inconvenient in text processing
• Puritanical OO can be inconvenient
• Bottom line: Java can make simple
projects harder
Trang 15Copyright 2001, ActiveState
Why not Java: political
• "100% pure Java" gets in the way
• The Java environment punishes
interoperability (e.g getenv is
Trang 16Copyright 2001, ActiveState
Jython (nee JPython)
• Compiles Python classes to Java
classes
• Embedded interpreter allows interactive coding
• Access to all Java classes
• For better or worse: maintains Java's
security/platform-independence bubble
Trang 17Copyright 2001, ActiveState
Jython can use Java tools
Trang 18Copyright 2001, ActiveState
• Raw text searching is not as fast as Perl.
• Dynamic type checking requires more care in testing.
Trang 19Copyright 2001, ActiveState
Python “Hello world"
print "Hello, World“
Trang 20Copyright 2001, ActiveState
Python interpreter
• Just type:
C:\> python
Python 1.5.2 (#0, Apr 13 1999, 10:51:12) [MSC 32 bit (Intel)] on win32
Copyright 1991-1995 Stichting Mathematisch Centrum, Amsterdam
>>> print "Hello, World"
Trang 21Copyright 2001, ActiveState
• py files get a pyc in the same directory
• When the py is updated, the pyc is
updated
Trang 22Copyright 2001, ActiveState
Interpreters
• DOS/Win32 (last slide)
• Unix (use ^D to exit)
• Graphical: “IDLE”, “PythonWin”
Trang 23Copyright 2001, ActiveState
Trang 24Copyright 2001, ActiveState
Numeric types
• int: 32 bit, e.g "x=5"
• long: arbitrary sized, e.g "x=2L**128"
• float: accuracy depends on platform, e.g "x=3.14"
• complex: real+imag., "x=5.3+3.2j"
Trang 25Copyright 2001, ActiveState
Sequence types:
• Strings: "abcd"
• Tuples: (1,2,"b")
• Lists: [1,"a",3]
Trang 26Copyright 2001, ActiveState
Trang 27Copyright 2001, ActiveState
Sequence types: string
myStr = "abc" # assignment
myStr = myStr + "def" # = "abcdef"
otherstr = myStr[ 1 : 4 ] # = "bcd"
Trang 28Copyright 2001, ActiveState
Sequence types: lists
myList = ["a", 5 , 3.25 , 2L , 4 + 3j ]
anotherList = ["a",myList, ["3","2"]] anotherList2 = myList + myList
# = ["a",5, ,"a",5, ]
yetAnotherList = myList[ 1 : 3 ]
# = [5,3.25]
Trang 29Copyright 2001, ActiveState
Iterating over sequences
strlist = ["abc", "def", "ghi"]
for item in strlist:
for char in item:
print char
Trang 30Copyright 2001, ActiveState
Trang 31Copyright 2001, ActiveState
Trang 32Copyright 2001, ActiveState
Trang 33Copyright 2001, ActiveState
Getting the length
• The len() function gets a sequence's length
>>> len( "abc" )
3
>>> len( ["abc","def"] )
2
Trang 34Copyright 2001, ActiveState
Traceback (innermost last):
File "<stdin>", line 1, in ?
TypeError: object doesn't support item assignment
Trang 35Copyright 2001, ActiveState
Dictionaries
• Serve as a lookup table
• Maps "keys" to "values"
• Keys can be of any immutable type
• Assignment adds or changes members
• keys() method returns keys
Trang 36Copyright 2001, ActiveState
'a' : 'alpha' }
Trang 37Copyright 2001, ActiveState
>>> dict.clear()
>>> print dict
{}
Trang 38Copyright 2001, ActiveState
File Objects
• Represent opened files:
myFile = open( "catalog.txt", "r" )
data = myFile.read()
myFile = open( "catalog2.txt", "w" )
data = data+ "more data"
myFile.write( data )
Trang 39Copyright 2001, ActiveState
Function definitions
• Encapsulate bits of code
• Can take a fixed or variable number of arguments
• Arguments can have default values
Trang 40Copyright 2001, ActiveState
Functions are objects
Trang 41Copyright 2001, ActiveState
Flow Control Statements
• if/then/else
• while
• for
• try
Trang 42Copyright 2001, ActiveState
Exception handling
• Python exception handling like Java/C++
• Errors are reported in tracebacks
• Exceptions propagate up
Trang 43Copyright 2001, ActiveState
Exception traceback
Traceback (innermost last):
File "test.py", line 10, in ?
Trang 44Copyright 2001, ActiveState
Classes
• Classes combine code and data.
• They represent real world objects.
• We create "instance objects" from classes.
• Closest languages in terms of object model are SmallTalk or Ruby.
• Much more flexible than Java or C++
• More central to the language than
Perl/Tcl/PHP.
Trang 45Copyright 2001, ActiveState
Inheritance
• Classes can specify a base class.
• The new class "inherits" methods and data.
• The new class can
– "override" methods.
– add data and methods.
• Multiple Inheritance is okay
• All methods are virtual.
Trang 46Copyright 2001, ActiveState
Modules and Packages
• A module is a set of code in a single file
• A package is a collection of related
modules
Trang 47Copyright 2001, ActiveState
XML and Python
• Accessing XML with Python
• Parsing XML with Python
– Non validating Parsers
– Validating Parsers
Trang 48Copyright 2001, ActiveState
Trang 49Copyright 2001, ActiveState
Trang 50Copyright 2001, ActiveState
Parsers for Jython
Trang 51Copyright 2001, ActiveState
Manipulating XML
• Flat file processing with RE's (briefly!)
• PySAX - Simple API for XML
• PyDOM - W3C Document Object Model
• …
Trang 52Copyright 2001, ActiveState
Flat File Processing
• XML documents are text
• Ordinary textual tools continue to work
• E.G Search for emph elements:
for i in re.search(
r"<emph>(.*)</emph>" , input ):
print i
Trang 53Copyright 2001, ActiveState
Flat File Recipe
• Unless your needs are very simple, let
me help you!
• I’ve already converted the ultimate XML parsing regular expression to Python:
http://aspn.activestate.com/ASPN/Python/Cookbook/Recipe/65125
Trang 54Copyright 2001, ActiveState
Events
• Think of an XML document as a series
of events
• "Start tag", "End tag", “Characters", etc
• We can handle hierarchy by tracking start/end tags
• We can deal with the document a little
at a time
Trang 55Copyright 2001, ActiveState
PySAX
• "Simple API for XML"
• Common API for parsers
• Based on Java API
• Parser implements certain interfaces
• Application implements callback
interfaces
Trang 56Copyright 2001, ActiveState
SAX Model
• The application hands the parser an
event handler object
• The parser sends events to the handler
• The handler can
– store them somehow,
– build something,
– re-route them to other parts of the
app
Trang 57Copyright 2001, ActiveState
Trang 58Copyright 2001, ActiveState
Trang 59Copyright 2001, ActiveState
Trang 60Copyright 2001, ActiveState
Trang 61Copyright 2001, ActiveState
print handler.tags
Trang 62Copyright 2001, ActiveState
Trang 63Copyright 2001, ActiveState
ErrorHandling
• In addition to content handler,
• we should assign an error handler.
class MyErrorHandler:
def warning(self, exception):
print "Whoa, nelly!"
print exception
def error(self, exception):
print "Whoa, nelly!"
raise exception
def fatalError(self, exception):
print "Whoa, nelly!" raise exception
Trang 64Copyright 2001, ActiveState
Trang 65Copyright 2001, ActiveState
Character handling
# print out characters in document
from xml.sax.handler import ContentHandler
import xml.sax, sys
Trang 66Copyright 2001, ActiveState
Document Object Model
• Document Object Model
• The DOM is a W3C standard
• Extended version of "Dynamic HTML"
• Defined in CORBA IDL
• Implemented in various languages
• Implemented in IE5.0 and eventually Netscape
Trang 67Copyright 2001, ActiveState
The DOM
• The DOM is a tree-based API
• This implies a certain amount of
overhead
• But also a lot of convenience and
flexibility
• XPath implementation essentially
requires tree-based APIs
Trang 68Copyright 2001, ActiveState
DOM Nodes
• Elements, attributes, comments, etc
called "nodes"
• Classes represent node types
• All node types subclass the "node" base class
Trang 69Copyright 2001, ActiveState
Trang 70Copyright 2001, ActiveState
Trang 71Copyright 2001, ActiveState
DOM node types
Trang 72Copyright 2001, ActiveState
More DOM node types
Trang 73Copyright 2001, ActiveState
Navigation properties
• parentNode - Parent of this node
• firstChild - First child of this node
• lastChild - Last child of this node
• previousSibling - Node immediately preceding this node
• nextSibling - Node immediately following this node
• childNodes - List containing all the children of this node
Trang 74Copyright 2001, ActiveState
<title> SIG for XML Processing in
Python </title>
</bookmark>
</folder>
Trang 75Copyright 2001, ActiveState
Trang 76Copyright 2001, ActiveState
First "title" node
Properties:
• parentNode: folder element
• firstChild: Text node 'XML bookmarks'
• lastChild: Text node 'XML bookmarks'
• previousSibling: codeNone
• nextSibling: bookmark element
• childNodes: A 1-element list: [ Text node
'XML bookmarks' ]
Trang 77Copyright 2001, ActiveState
Trang 78Copyright 2001, ActiveState
Trang 79Copyright 2001, ActiveState
Trang 80Copyright 2001, ActiveState
Modifying a DOM
appendChild(newChild)
insertBefore(newChild, refChild) replaceChild(newChild, oldChild)
removeChild(oldChild)
Trang 81Copyright 2001, ActiveState
The Document Node
• One Document node per document
• The base of the entire tree
• documentElement attribute contains a single Element node
• childNodes may have additional
children, such as ProcessingInstruction nodes
Trang 82Copyright 2001, ActiveState
Trang 83Copyright 2001, ActiveState
PyDOM
• A richer, more robust DOM than
minidom
• More classes, support for DOM 2+
• Integration with XPath and XSLT
Trang 84Copyright 2001, ActiveState
Trang 85Copyright 2001, ActiveState
PyXML Parsers
• Xml.parsers.xmlproc
• Qp_xml
• Xml.sax.drivers
Trang 86Copyright 2001, ActiveState
Trang 87Copyright 2001, ActiveState
– …
Trang 88Copyright 2001, ActiveState
Python SOAP Example
• SOAP.py:
import SOAP
server =
SOAP.SOAPProxy( "http://local host:8000/" )
print server.echo( "Hello
world" )
Trang 89Copyright 2001, ActiveState
XML and Zope
• Zope is an Open Source application server
that publishes objects on the Internet.
• ParsedXML: Breaks up an XML document
into bits.
• XML-RPC: You can plumb the depths of Zope with XML-RPC.
• Zcatalog: Index based on element-type
names, attribute names, etc.
Trang 90Copyright 2001, ActiveState
ParsedXML
• A free Zope “product” (extension)
• Every element is a first-class Zope
object
• You can add “behavior” to XML
documents
• RSS Channel Product
Trang 91Copyright 2001, ActiveState
Trang 92Copyright 2001, ActiveState
Redfoot
• Redfoot is a framework for distributed based applications, written in Python.
RDF-– an RDF database
– a query API for RDF
– an RDF parser and serializer
– a simple HTTP server providing a web interface for viewing and editing RDF
– a fully customizable UI
– the beginnings of a peer-to-peer architecture for communication between different RDF databases
Trang 93Copyright 2001, ActiveState
Trang 94Copyright 2001, ActiveState