API for XML Processing (to repeat)
-
APIs offer simple standardized XML access.
-
APIs connect application to the parser and applications together.
-
APIs allow XML processing without knowledge of physical document structure (entities).
-
APIs optimize XML processing.
XML APIs Fundamental Types
- Tree-based API
-
tree representation in constructed and processed
- Event-based API
-
events are produced and handled
- Pull API
-
events are pulled off the document
Tree-based API
-
They map an XML document to a memory-based tree structure.
-
It allows to traverse the entire DOM tree.
-
Best-known is Document Object Model (DOM) from W3C, http://www.w3.org/DOM)
Programming Language Specific Models
-
Java JDOM - http://jdom.org
-
Java dom4j - http://dom4j.org
-
Java XOM - http://www.xom.nu
-
Python 4Suite - http://4suite.org
-
PHP SimpleXML - http://www.php.net/simplexml
Document Object Model (DOM)
-
Basic interface to process and access the tree representation of XML data
-
Three versions of DOM: DOM Level 1, 2, 3
-
DOM - does not depend on the XML parsing.
-
Described using IDL + API descriptions for particular programming languages (C++, Java, etc.)
HTML Documents Specific DOM
-
The HTML Core DOM is more less consolidated with the XML DOM
-
Designated to CSS
-
Used for dynamic HTML programming (scripting using VB Script, JavaScript, etc)
-
Contains the browser environment (windows, history, etc) besides the document model itself.
DOM references
-
JAXP Tutorial, part dedicated to the DOM Part III: XML and the Document Object Model (DOM) (http://java.sun.com/xml/jaxp/dist/1.1/docs/tutorial/dom/index.html)
-
Portal dedicated to the DOM http://www.oasis-open.org/cover/dom.html
-
DOM 1 Interface visual overview http://www.xml.com/pub/a/1999/07/dom/index.html
-
Tutorial ”Understanding DOM (Level 2)” available at https://www.ibm.com/developerworks/xml/
Using DOM in Java
-
Native DOM support in the new Java versions (JDK and JRE) - no need of additional library.
-
Applications need to import needed symbols (interfaces, classes, etc.) mostly from package
org.w3c.dom
.
What will we need often?
Most often used interfaces are:
-
Element
corresponds to the element in a logical document structure. It allows us to access name of the element, names of attributes, child nodes (including textual ones). Useful methods: -
Node getParentNode()
- returns the parent node -
String getTextContent()
- returns textual content of the element. -
NodeList getElementsByTagName(String name)
- returns the list of ancestors (child nodes and their ancestors) with the given name. -
Node
super interface ofElement
, corresponds to the general node in a logical document structure, may contain element, textual node, comment, etc. -
NodeList
a list of nodes (a result of callinggetElementsByTagName
for example). It offers the following methods for its processing: -
int getLength()
- returns the number of nodes in a list -
Node item(int index)
- returns the node at position index -
Document
corresponds to the document node (its a parent of a root element)
Example 1 - creating DOM tree from file
public class Task1 { public Task1(URL url) throws SAXException, ParserConfigurationException, IOException { // We create new instance of factory class DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance(); // We get new instance of DocumentBuilder using the factory class. DocumentBuilder builder = factory.newDocumentBuilder(); // We utilize the DocumentBuilder to process an XML document // and we get document model in form of W3C DOM Document doc = builder.parse(url.toString()); } }
Example 2 - DOM tree modification
public class Task1 { private Document doc; // Method for a salary modification. // If the person’s salary is less then // minimum, the salary will increased to minimum. // No action is performed for the other persons. public void adjustSalary(double minimum) { // get the list of salaries NodeList salaries = doc.getElementsByTagName("salary"); for (int i = 0; i < salaries.getLength(); i++) { // get the salary element Element salaryElement = (Element) salaries.item(i); // get payment double salary = Double.parseDouble( salaryElement.getTextContent()); if (salary < minimum) { // modify the text node/content of element salaryElement.setTextContent(String.valueOf(minimum)); } } } }
Example 3 - storing a DOM tree into an XML file
Example of the method storing a DOM tree into a file (see Homework 1). The procedure utilizes a transformation we do not know yet. Let use it as a black box.
public class Task1 { private Document doc; public void serializetoXML(File output) throws IOException, TransformerConfigurationException { // We create new instance of a factory class. TransformerFactory factory = TransformerFactory.newInstance(); Transformer transformer = factory.newTransformer(); // The input is the document placed in a memory DOMSource source = new DOMSource(doc); // The transformation output is the output file StreamResult result = new StreamResult(output); // Let’s make the transformation transformer.transform(source, result); } }
Event-based API
-
Generates Sequence of Events while parsing the Document.
-
Technical implementation: using callback methods
[The Hollywood Principle: Do not call us, we will call you!]
-
Application implements handlers (which process generated events).
-
Works on lower-level than tree-based.
-
Application should do further processing.
-
It saves memory - does not itself create any persistent objects.
Event Examples
-
start document
,end document
-
start element
- contains the attributes as well,end element
. -
processing instruction
-
comment
-
entity reference
-
Best-known event-based API: SAX http://www.saxproject.org
SAX - Document Analysis Example
<?xml version="1.0"?> <doc> <para>Hello, world!</para> <!-- that’s all folks -→ <hr/> </doc>
SAX - Document Analysis Example
It generates following events:
start document start element: doc
list of attributes: empty
start element: para
list of attributes: empty
characters: Hello, world!
end element: para
comment: that’s all folks
start element: hr
end element: hr
end element: doc end document
When to use event-based API?
-
Easier to parser programmer, more difficult to application programmer.
-
No complete document available to application programmer.
-
She must keep the state of analysis herself.
-
Suitable for tasks, that can be solved without the need of entire document.
-
The fastest possible processing usually.
-
Difficulties while writing applications can be solved using extensions like Streaming Transformations for XML (STX), http://stx.sourceforge.net
Optional SAX Parser Features
-
The SAX parser behavior can be controlled using so called features a properties.
-
For optional SAX parser’s features see http://www.saxproject.org/?selected=get-set
-
For more details on properties and features see Use properties and features in SAX parsers (IBM DeveloperWorks/XML).
SAX filters
-
The SAX filters (implementation of
org.xml.sax.XMLFilter
interface) can be programmed using the SAX API. -
Such a class instance accepts input events, process them and sends them to the output.
-
For more information on event filtering see Change the events output by a SAX stream http://www.ibm.com/developerworks/xml/library/x-tipsaxfilter/ (IBM DeveloperWorks/XML) for example.
Additional SAX References
-
Primary source: http://www.saxproject.org
-
SAX Tutorial on JAXP: http://java.sun.com/webservices/reference/tutorials/jaxp/html/sax.html
Pull-based APIs
-
Application does not process incoming events, but it pulls data from the processed file.
-
Can be used when programmer knows the structure of an input data and she can pull them off the file.
-
As opposite to event-based API.
-
Very comfortable to an application programmer, but implementations are usually slower the push event-based APIs.
-
Java offers the XML-PULL parser API - see Common API for XML Pull Parsing http://www.xmlpull.org/ and also
-
newly develop API - Streaming API for XML (StAX) http://www.jcp.org/en/jsr/detail?id=173 developed like a product of JCP (Java Community Process).
Streaming API for XML (StAX)
-
The API may become the part of the Java API for XML Processing (JAXP) in the future.
-
It offers two ways to pull-based processing:
-
pulling the events using iterator - more comfortable
-
low-level access using so called cursor - it is faster.
StAX - an Iterator Example
(from Oracle Java Tutorials http://docs.oracle.com/javase/tutorial/jaxp/stax/example.html)
StAX - source XML document
<?xml version="1.0" encoding="UTF-8"?> <BookCatalogue xmlns="http://www.publishing.org"> <Book> <Title>Yogasana Vijnana: the Science of Yoga</Title> <author>Dhirendra Brahmachari</Author> <Date>1966</Date> <ISBN>81-40-34319-4</ISBN> <Publisher>Dhirendra Yoga Publications</Publisher> <Cost currency="INR">11.50</Cost> </Book> <Book> <Title>The First and Last Freedom</Title> <Author>J. Krishnamurti</Author> <Date>1954</Date> <ISBN>0-06-064831-7</ISBN> <Publisher>Harper & Row</Publisher> <Cost currency="USD">2.95</Cost> </Book> </BookCatalogue>
StAX - source XML document
In this example, the client application pulls the next event in the XML stream by calling the next method on the parser; for example:
try { for (int i = 0 ; i < count ; i++) { // pass the file name.. all relative entity // references will be resolved against this // as base URI. XMLStreamReader xmlr = xmlif.createXMLStreamReader(filename, new FileInputStream(filename)); // when XMLStreamReader is created, // it is positioned at START_DOCUMENT event. int eventType = xmlr.getEventType(); printEventType(eventType); printStartDocument(xmlr); // check if there are more events // in the input stream while(xmlr.hasNext()) { eventType = xmlr.next(); printEventType(eventType); // these functions print the information // about the particular event by calling // the relevant function printStartElement(xmlr); printEndElement(xmlr); printText(xmlr); printPIData(xmlr); printComment(xmlr); } } }
Tree and event-based access combinations
-
Events → tree
-
Tree → events
Events → tree
-
Allow us either to skip or to filter out the ”uninteresting” document part using the event monitoring and then
-
create memory-based tree from the ”interesting” part of a document only and that part process.
Tree → events
-
We create an entire document tree (and process it) and
-
we go through the tree than and we generate events like while reading the XML file.
-
It allows us easy integration of both processing types in a single application.
Virtual object models
-
Document DOM model is not memory places, but is created on-demand while accessing particular nodes.
-
combines event-based and tree-based processing advantages (speed and comfort)
-
There is an implementation: the Sablotron processor, http://www.xml.com/pub/a/2002/03/13/sablotron.html
Alternative tree-based models
XML Object Model (XOM)
-
XOM (XML Object Model) created as an one man project (author Elliote Rusty Harold).
-
It is an interface that strictly respect XML data logical model.
-
For motivation and specification see the XOM home page (http://www.xom.nu).
-
You can get there the open-sourceXOM implementation and
-
the API documentation, too.
DOM4J - practically usable tree-based model
-
comfortable, fast and memory efficient tree-oriented interface
-
designed and optimized for Java
-
available as open-source at http://dom4j.org
-
perfect ”cookbook” (http://dom4j.org/cookbook/cookbook.html) available
-
dom4j is powerful, seetree-based models efficiency comparison (http://www.ibm.com/developerworks/xml/library/x-injava/)