PB138 — XML Processing in general
programming languages (XML APIs)
API for XML Processing (to repeat)
• APIs offer simple standardized XML access.
• APIs connect application to the parser and applications together.
• APIs allow XML processing without knowledge of physical document structure (entities).
• APIs optimize XML processing.
XML APIs Fundamental Types
Tree-based API
tree representation in constructed and processed
Event-based API
events are produced and handled
Pull API
events are pulled off the document
Tree-based API
• They map an XML document to a memory-based tree structure.
• It allows to traverse the entire DOM tree.
• Best-known is Document Object Model (DOM) from W3C, http://www.w3.org/DOM)
Programming Language Specific Models
• Java JDOM - http://jdom.org
• Java dom4j - http://dom4j.github.io
• Java XOM - http://www.xom.nu
• Python 4Suite - http://4suite.org
• PHP SimpleXML - http://www.php.net/simplexml
Document Object Model (DOM)
• Basic interface to process and access the tree representation of XML data
1
• Three versions of DOM: DOM Level 1, 2, 3
• DOM - does not depend on the XML parsing.
• Described using IDL + API descriptions for particular programming languages (C++, Java, etc.)
DOM Levels
• DOM Level 1 — provides low-level set of fundamental interfaces as well as extended interfaces
those can represent any structured document (Document, Element, DocumentFragment, etc, see
DOM Level 1 Specification).
• DOM Level 2 — defines platform- and language neutral interface that allow to dynamically
access and update the content and structure of documents (see DOM Level 2 Specification)
• DOM Level 3 — enhances DOM Level 2 by completing mapping between DOM and XML
Information Set, by including support for XML Base, allows to attach user information to DOM
Nodes, etc (see DOM Level 3 Specification)
HTML Documents Speciﬁc DOM
• The HTML Core DOM is more less consolidated with the XML DOM
• Designated to CSS
• Used for dynamic HTML programming (scripting using VB Script, JavaScript, etc)
• Contains the browser environment (windows, history, etc) besides the document model itself.
DOM references
• JAXP Tutorial, part dedicated to the DOM Part III: XML and the Document Object Model (DOM)
(http://java.sun.com/xml/jaxp/dist/1.1/docs/tutorial/dom/index.html)
• Portal dedicated to the DOM http://www.oasis-open.org/cover/dom.html
• DOM 1 Interface visual overview http://www.xml.com/pub/a/1999/07/dom/index.html
• Tutorial ”Understanding DOM (Level 2)” available at https://www.ibm.com/developerworks/xml/
Using DOM in Java
• Native DOM support in the new Java versions (JDK and JRE) - no need of additional library.
• Applications need to import needed symbols (interfaces, classes, etc.) mostly from package
org.w3c.dom.
What we frequently need
Most often used interfaces are:
2
• Element corresponds to the element in a logical document structure. It allows us to access name
of the element, names of attributes, child nodes (including textual ones). Useful methods:
• Node getParentNode() - returns the parent node
• String getTextContent() - returns textual content of the element.
• NodeList getElementsByTagName(String name) - returns the list of ancestors (child nodes and their
ancestors) with the given name.
What we frequently need (2)
• Node super interface of Element, corresponds to the general node in a logical document structure,
may contain element, textual node, comment, etc.
• NodeList a list of nodes (a result of calling getElementsByTagName for example). It oﬀers the
following methods for its processing:
• int getLength() - returns the number of nodes in a list
• Node item(int index) - returns the node at position index
• Document corresponds to the document node (its a parent of a root element)
Example 1 - creating DOM tree from file
  public class Task1 {
  public Task1(URL url) throws SAXException,
  ParserConfigurationException, IOException {
  // We create new instance of factory class
  DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
  // We get new instance of DocumentBuilder using the factory class.
  DocumentBuilder builder = factory.newDocumentBuilder();
  // We utilize the DocumentBuilder to process an XML document
  // and we get document model in form of W3C DOM
  Document doc = builder.parse(url.toString());
  }
  }
Example 2 - DOM tree modification
public class Task1 {
  private Document doc;
  public void adjustSalary(double minimum) {
  NodeList salaries = doc.getElementsByTagName("salary");
  for (int i = 0; i < salaries.getLength(); i++) {
  Element salaryElement = (Element) salaries.item(i);
  double salary = Double.parseDouble(
  salaryElement.getTextContent());
3
  if (salary < minimum) {
  salaryElement.setTextContent(String.valueOf(minimum));
  }
  }
  }
}
Example 3 - storing a DOM tree into an XML
file
Example of the method storing a DOM tree into a file (see Homework 1). The procedure utilizes a
transformation we do not know yet. Let use it as a black box.
public class Task1 {
  private Document doc;
  public void serializetoXML(File output) throws IOException,
  TransformerConfigurationException {
  TransformerFactory factory
  = TransformerFactory.newInstance();
  Transformer transformer
  = factory.newTransformer();
  DOMSource source = new DOMSource(doc);
  StreamResult result = new StreamResult(output);
  transformer.transform(source, result);
  }
}
Event-based API
• Generates Sequence of Events while parsing the Document.
• Technical implementation: using callback methods
[1]
• Application implements handlers (which process generated events).
• Works on lower-level than tree-based.
• Application should do further processing.
• It saves memory - does not itself create any persistent objects.
Event Examples
• start document, end document
• start element - contains the attributes as well, end element.
• processing instruction
4
• comment
• entity reference
• Best-known event-based API: SAX http://www.saxproject.org
SAX - Document Analysis Example
<?xml version="1.0"?>
<doc>
<para>Hello, world!</para>
<!-- that’s all folks -->
<hr/>
</doc>
SAX - Document Analysis Example
It generates following events:
start document start element: doc
list of attributes: empty
start element: para
list of attributes: empty
characters: Hello, world!
SAX - Document Analysis Example (2)
end element: para
comment: that’s all folks
start element: hr
end element: hr
end element: doc
end document
When to use event-based API?
• Easier to parser programmer, more difficult to application programmer.
• No complete document available to application programmer.
• Programmers must keep the state of analysis themself.
• Suitable for tasks, that can be solved without the need of entire document.
• The fastest possible processing usually.
5
• Difficulties while writing applications can be solved using extensions like Streaming
Transformations for XML (STX), http://stx.sourceforge.net
Optional SAX Parser Features
• The SAX parser behavior can be controlled using so called features a properties.
• For optional SAX parser’s features see http://www.saxproject.org/?selected=get-set
• For more details on properties and features see Use properties and features in SAX parsers (IBM
DeveloperWorks/XML).
SAX filters
• The SAX filters (implementation of org.xml.sax.XMLFilter interface) can be programmed using
the SAX API.
• Such a class instance accepts input events, process them and sends them to the output.
• For more information on event filtering see Change the events output by a SAX stream
http://www.ibm.com/developerworks/xml/library/x-tipsaxfilter/ (IBM DeveloperWorks/XML) for
example.
Additional SAX References
• Primary source: http://www.saxproject.org
• SAX Tutorial on JAXP: http://java.sun.com/webservices/reference/tutorials/jaxp/html/sax.html
Pull-based APIs
• Application does not process incoming events, but it pulls data from the processed file.
• Can be used when programmer knows the structure of an input data and she can pull them off
the file.
• As opposite to event-based API.
• Very comfortable to an application programmer, but implementations are usually slower the
push event-based APIs.
Java Pull-based APIs
• Java offers the XML-PULL parser API - see Common API for XML Pull Parsing
http://www.xmlpull.org/ and also
• newly develop API - Streaming API for XML (StAX) http://www.jcp.org/en/jsr/detail?id=173
developed like a product of JCP (Java Community Process).
6
Streaming API for XML (StAX)
• The API may become the part of the Java API for XML Processing (JAXP) in the future.
• It offers two ways to pull-based processing:
• pulling the events using iterator - more comfortable
• low-level access using so called cursor - it is faster.
StAX - an Iterator Example
• from Oracle Java Tutorials http://docs.oracle.com/javase/tutorial/jaxp/stax/example.html
• In this example, the client application pulls the next event in the XML stream by calling the next
method on the parser.
StAX - source XML document
<?xml version="1.0" encoding="UTF-8"?>
<BookCatalogue xmlns="http://www.publishing.org">
  <Book>
  <Title>Yogasana Vijnana: the Science of Yoga</Title>
  <author>Dhirendra Brahmachari</Author>
  <Date>1966</Date>
  <ISBN>81-40-34319-4</ISBN>
  <Publisher>Dhirendra Yoga Publications</Publisher>
  <Cost currency="INR">11.50</Cost>
  </Book>
StAX - source XML document
  <Book>
  <Title>The First and Last Freedom</Title>
  <Author>J. Krishnamurti</Author>
  <Date>1954</Date>
  <ISBN>0-06-064831-7</ISBN>
  <Publisher>Harper &amp; Row</Publisher>
  <Cost currency="USD">2.95</Cost>
  </Book>
</BookCatalogue>
StAX - Java code
try {
7
  for (int i = 0 ; i < count ; i++) {
  // pass the file name.. all relative entity
  // references will be resolved against this
  // as base URI.
  XMLStreamReader xmlr = xmlif.createXMLStreamReader(filename,
  new FileInputStream(filename));
  // when XMLStreamReader is created,
  // it is positioned at START_DOCUMENT event.
  int eventType = xmlr.getEventType();
  printEventType(eventType);
  printStartDocument(xmlr);
  // check if there are more events
  // in the input stream
StAX - Java code
  while(xmlr.hasNext()) {
  eventType = xmlr.next();
  printEventType(eventType);
  // these functions print the information
  // about the particular event by calling
  // the relevant function
  printStartElement(xmlr);
  printEndElement(xmlr);
  printText(xmlr);
  printPIData(xmlr);
  printComment(xmlr);
  }
  }
  }
Tree and event-based access combinations
• Events → tree
• Tree → events
Events → tree
• Allow us either to skip or to filter out the ”uninteresting” document part using the event
monitoring and then
• create memory-based tree from the ”interesting” part of a document only and that part process.
8
Tree → events
• We create an entire document tree (and process it) and
• we go through the tree than and we generate events like while reading the XML file.
• It allows us easy integration of both processing types in a single application.
Virtual object models
• Document DOM model is not memory places, but is created on-demand while accessing
particular nodes.
• combines event-based and tree-based processing advantages (speed and comfort)
• There is an implementation: the Sablotron processor, http://www.xml.com/pub/a/2002/03/13/
sablotron.html
Alternative tree-based models
• XML Object Model (XOM)
• DOM4J
XML Object Model (XOM)
• XOM (XML Object Model) created as an one man project (author Elliote Rusty Harold).
• It is an interface that strictly respect XML data logical model.
• For motivation and specification see the XOM home page (http://www.xom.nu).
• You can get there the open-sourceXOM implementation and
• the API documentation, too.
DOM4J - practically usable tree-based model
• comfortable, fast and memory efficient tree-oriented interface
• designed and optimized for Java
• available as open-source at http://dom4j.github.io
• perfect ”cookbook” available
[1] The Hollywood Principle: Do not call us, we will call you!
9