XML Namespaces, XML API, XPath

XML Namespaces (jmenné prostory)

XML Namespaces (W3C Recommendation, currently Namespaces in XML 1.0 (Third Edition) W3C Recommendation 8 Dec 2009): http://www.w3.org/TR/REC-xml-names
to new XML, there exists Namespaces in XML 1.1 W3C Recommendation (Second Edition) 16 August 2006. Andrew Layman, Richard Tobin, Tim Bray, Dave Hollander
They define logical spaces for names of elements, attributes in XML document.
They give the elements and attributes the "third dimension".
To each NS in XML, there is exactly one ("globally") unique identifier, given by URI (URIs is a superset of URLs).
NS corresponding to an URI does not anyhow relate to content that would potentially be available under the URL ("nothing is downloaded when processing NSs".

Prefixes and Equivalence of NSs (1)

Instead of URIs for denoting a namespace in document, one uses prefixes for these NS mapped to the respective URI using xmlns:prefix="URI".
Element- or attribute-name containing colon (:) is denoted as Qualified Name, QName.
Two NS are equal iff their URIs are one-to-one-character the same (in UNICODE).
Namespaces do not apply to text nodes.
Element/attribute need not be in a namespace.
NS prefix declaration or declaration or the implicit NS recursively applies to all descendants (child elements, their children etc.), unless another declaration "remaps" the given prefix.
One NS is co-called implicit (default) NS, declared by attribute xmlns=
Default NSs are NOT applied to attributes!!!, thus attributes without an explicit prefix do not belong to any NS.

Example 1. Default NS

<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
   <body>
      <h1>Huraaaa</h1>
   </body>
</html>

Example 2. Prefixed NS

<xhtml:html xmlns:xhtml="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
   <xhtml:body>
      <xhtml:h1>Huraaaa</xhtml:h1>
   </xhtml:body>
</xhtml:html>

Issues related to NS

NS are not compatible with DTD.
DTD strictly differentiates between eg. name xi:include and include even if they belong to the same NS and should thus have the same interpretation/meaning for applications.

API for XML Processing (to repeat)

APIs offer simple standardized XML access.
APIs connect application to the parser and applications together.
APIs allow XML processing without knowledge of physical document structure (entities).
APIs optimize XML processing.

XML APIs Fundamental Types

Tree-based API
Event-based API
API based on pulling events/elements off the document (Pull API).

Tree-based API

They map an XML document to a memory-based tree structure.
allows to traverse the entire DOM tree.
best-known - Document Object Model (DOM from W3C, see http://www.w3.org/DOM)

Programming Language Specific Models

Java: JDOM - http://jdom.org
Java: dom4j - http://dom4j.org
Java: XOM - http://www.xom.nu
Python: 4Suite - http://4suite.org
PHP: SimpleXML - http://www.php.net/simplexml

Document Object Model (DOM)

Basic interface to process and access the tree representation of XML data
Three versions of DOM: DOM Level 1, 2, 3
DOM - does not depend on the XML parsing.
Described using IDL + API descriptions for particular programming languages (C++, Java, etc.)

HTML Documents Speciﬁc DOM

The HTML Core DOM is more less consolidated with the XML DOM
Designated to CSS
Used for dynamic HTML programming (scripting using VB Script, JavaScript, etc)
Contains the browser environment (windows, history, etc) besides the document model itself.

DOM references

JAXP Tutorial, part dedicated to the DOM Part III: XML and the Document Object Model (DOM) (http://java.sun.com/xml/jaxp/dist/1.1/docs/tutorial/dom/index.html)
Portal dedicated to the DOM http://www.oasis-open.org/cover/dom.html
DOM 1 Interface visual overview http://www.xml.com/pub/a/1999/07/dom/index.html
Tutorial ”Understanding DOM (Level 2)” available at https://www.ibm.com/developerworks/xml/

Using DOM in Java

Native DOM support in the new Java versions (JDK and JRE) - no need of additional library.
Applications need to import needed symbols (interfaces, classes, etc.) mostly from package org.w3c.dom.

What will we need often?

Most often used interfaces are:

Element corresponds to the element in a logical document structure. It allows us to access name of the element, names of attributes, child nodes (including textual ones). Useful methods:
Node getParentNode() - returns the parent node
String getTextContent() - returns textual content of the element.
NodeList getElementsByTagName(String name) - returns the list of ancestors (child nodes and their ancestors) with the given name.
Node super interface of Element, corresponds to the general node in a logical document structure, may contain element, textual node, comment, etc.
NodeList a list of nodes (a result of calling getElementsByTagName for example). It oﬀers the following methods for its processing:
int getLength() - returns the number of nodes in a list
Node item(int index) - returns the node at position index
Document corresponds to the document node (its a parent of a root element)

Example 1 - creating DOM tree from file

public class Task1 {
  private Task1(URL url) throws SAXException,
    ParserConfigurationException, IOException {
    // We create new instance of factory class
    DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
    // We get new instance of DocumentBuilder using the factory class.
    DocumentBuilder builder = factory.newDocumentBuilder();
    // We utilize the DocumentBuilder to process an XML document
    // and we get document model in form of W3C DOM
    Document doc = builder.parse(url.toString());
  }
}

Example 2 - DOM tree modification

public class Task1 {
  private Document doc;
  // Method for a salary modification.
  // If the person’s salary is less then
  // minimum, the salary will increased to minimum.
  // No action is performed for the other persons.
  public void adjustSalary(double minimum) {
    // get the list of salaries
    NodeList salaries = doc.getElementsByTagName("salary");
    for (int i = 0; i < salaries.getLength(); i++) {
      // get the salary element
      Element salaryElement = (Element) salaries.item(i);
      // get payment
      double salary = Double.parseDouble(
         salaryElement.getTextContent());
      if (salary < minimum) {
        // modify the text node/content of element
        salaryElement.setTextContent(String.valueOf(minimum));
      }
    }
  }
}

Example 3 - storing a DOM tree into an XML file

Example of the method storing a DOM tree into a file (see Homework 1). The procedure utilizes a transformation we do not know yet. Let use it as a black box.

public class Task1 {
   private Document doc;
   public void serializetoXML(File output) throws IOException,
   TransformerConfigurationException {
      // We create new instance of a factory class.
      TransformerFactory factory
        = TransformerFactory.newInstance();
      Transformer transformer
        = factory.newTransformer();
      // The input is the document placed in a memory
      DOMSource source = new DOMSource(doc);
      // The transformation output is the output file
      StreamResult result = new StreamResult(output);
      // Let’s make the transformation
      transformer.transform(source, result);
   }
}

Alternative tree-based models

XML Object Model (XOM)

XOM (XML Object Model) created as an one man project (author Elliote Rusty Harold).
It is an interface that strictly respect XML data logical model.
For motivation and specification see the XOM home page (http://www.xom.nu).
You can get there the open-sourceXOM implementation and
the API documentation, too.

DOM4J - practically usable tree-based model

comfortable, fast and memory eﬃcient tree-oriented interface
designed and optimized for Java
available as open-source at http://dom4j.org
perfect ”cookbook” (http://dom4j.org/cookbook/cookbook.html) available
dom4j is powerful, seetree-based models eﬃciency comparison (http://www.ibm.com/developerworks/xml/library/x-injava/)

Tree and event-based access combinations

Events → tree
Tree → events

Events → tree

Allow us either to skip or to filter out the ”uninteresting” document part using the event monitoring and then
create memory-based tree from the ”interesting” part of a document only and that part process.

Tree → events

We create an entire document tree (and process it) and
we go through the tree than and we generate events like while reading the XML file.
It allows us easy integration of both processing types in a single application.

Virtual object models

Document DOM model is not memory places, but is created on-demand while accessing particular nodes.
combines event-based and tree-based processing advantages (speed and comfort)
There is an implementation: the Sablotron processor, http://www.xml.com/pub/a/2002/03/13/sablotron.html

XPath - basic principles

XPath is a syntax used to specify parts of XML documents (nodes, sets of nodes, sequences of nodes; does not allow to specify parts of text nodes).
XPath uses syntax similar to file system path.
XPath offers standard functions library (as well as user defined functions in either some XPath 2.0 or even XPath 1.x processors).
XPath is used as a base in XSLT since version 1.0 and XQuery since version 2.0.
XPath does not use XML syntax (it would be too long)
XPath 1.0 and 2.0 are W3C Recommendation - http://www.w3.org/TR/xpath

XPath - Application Domains

Advanced XML Data navigation

 <?xml version="1.0"?>
 <a>
  <b/>
  <b>
    <c/>
  </b>
  <b>
    <c/>
  </b>
 </a>

Select the 3rd node b:

 //b[3]

Select a node b, it has a child node c:

 //b[./c]

Select an empty node b:

 //b[count(./*)=0]

XPath - Application Domains

Transformation (XSLT)
used to select nodes, they have to be processed

   <xsl:value-of select="./c"/>

XPath - Application Domains

Selection parts of XML query languages (XQuery)
Some XML modeling languages (Schematron, XML Schema)
…

XPath - terms paths and locations

Path describes (means. "navigates") XML document location. Paths syntax is constructed similar way to paths on file systems, it means like
- relative - related to a context node (CN), see further or
- absolute - related to the root element, but predicates are evaluated in relation to CN.

XPath - syntactic rules

[20] PathExpr ::= AbsolutePathExpr | RelativePathExpr
[22] AbsolutePathExpr ::= ("/" RelativePathExpr?) | ("//" RelativePathExpr)
[23] RelativePathExpr ::= StepExpr (("/" | "//") StepExpr)*
[24] StepExpr ::= AxisStep | GeneralStep
[25] AxisStep ::= (Axis? NodeTest StepQualifiers) | AbbreviatedStep

XPath - axes

Axes (singular axis, plural axes) are sets of document elements, related to (usually relatively) to context.
Context is formed by a document and the current (context) node (CN).
Axes are:
- child - contains direct child nodes of CN
- descendant - contains all descendants of CN except attributes.
- parent - contains the CN parent nod (if it exists)
- ancestor - contains all ancestors of CN - means parents, grandparents, etc to a root element (if the CN is not the root element itself)
- following-sibling - contains all following siblings of CN (the axis is empty for NS and attributes)
- preceding-sibling - dtto, but it contains the preceding sibling.
- following - contains all nodes following the CN (except the attributes, child nodes and NS nodes)
- preceding - dtto, but contains preceding nodes (except ancestors, attributes, NS)
- attribute - contains attributes (for elements only)
- namespace - contains all NS nodes of CN (for elements only)
- self - the CN itself
- descendant-or-self - contains the union of descendant and self axes
- ancestor-or-self - contains the union of ancestor and self axes

Figure 1. //b/child::*

<?xml version="1.0"?>
<a>
 <b/>
 <b>
     <c/>
 </b>
 <b>
     <c/>
 </b>
</a>

Example 3. //b/descendant::*

<?xml version="1.0"?>
<a>
 <b/>
 <b>
      <c>
          <d/>
      </c>
 </b>
 <b>
      <c/>
 </b>
</a>

Example 4. //d/parent::*

<?xml version="1.0"?>
<a>
 <b/>
 <b>
        <c>
           <d/>
        </c>
 </b>
 <b>
        <c/>
 </b>
</a>

Example 5. //d/ancestor::*

<?xml version="1.0"?>
<a>
  <b/>
  <b>
      <c>
          <d/>
      </c>
  </b>
  <b>
       <c/>
  </b>
</a>

Example 6. //b/following-sibling::*

<?xml version="1.0"?>
<a>
 <b/>
 <b>
    <c>
       <d/>
    </c>
 </b>
 <b>
     <c/>
 </b>
</a>

Example 7. //b/preceding-sibling::*

<?xml version="1.0"?>
<a>
 <b/>
 <b>
      <c>
          <d/>
      </c>
 </b>
 <b>
      <c/>
 </b>
</a>

Example 8. /a/b/c/following::*

<?xml version="1.0"?>
<a>
    <b/>
    <b>
        <c>
             <d/>
        </c>
        <e/>
    </b>
    <b>
        <c/>
    </b>
</a>

Example 9. /a/b/e/preceding::*

<?xml version="1.0"?>
<a>
    <b/>
    <b>
        <c>
            <d/>
        </c>
    </b>
    <b>
        <d/>
        <e/>
    </b>
</a>

XPath - predicates

Assigned to selection from node set specified by path for example.
Figure: /article/para[3] - selects the 3rd paragraph (element para) of article (element article)
Simplest predicate expression is proximity position specification - see preceding.
- Attention at reverse axes (ancestor, preceding, …) - position is numbered always from CN, means opposite to document physical location directions.
- Position specification 3 can be replace by expression position()=3.

XPath - expressions

Used in predicates to computations, etc The may contain XPath functions.
Expressions may operate on:
- text strings
- numbers (floating-point numbers)
- logical values (boolean)
- nodes
- sequences.

XPath - short notation - Examples

para selects all child nodes of context node with name para
* selects all element children of the context node
text() selects all text node children of the context node
@name selects the name attribute of the context node
@* selects all the attributes of the context node
para[1] selects the first para child of the context node
para[last()] selects the last para child of the context node
*/para selects all para grandchildren of the context node
/doc/chapter[5]/section[2] selects the second section of the fifth chapter of the doc
chapter//para - selects all descendants of element chapter with name para
//para - selects all elements para in the document
//olist/item - selects all elements item with parent element olist
.//para selects all descendant nodes of CN with name para
.. selects the parent node of CN
../@lang selects a lang attribute of CN parent node

XPath - short notation (2)

Most common used short notation is at child axis

we use article/para instead of child::article/child::para.
at attribute:we use para[@type="warning"] instead of child::para[attribute::type="warning"]
The next used short notation is // instead of /descendant-or-self::node()/
and of course shortcuts . and ..

For clarity, we keep sometimes the longer form: Do not fight it at all costs!

Further Information on XPath

XPath on W3C: http://www.w3.org/TR/xpath
Zvon XPath Tutorial: http://zvon.org/xxl/XPathTutorial/Output/index.html
XPath Tutorial on W3Schools: http://www.w3schools.com/xpath/xpath_intro.asp

XPath 2.0

Final specification available at - http://www.w3.org/TR/xpath20/
Different point of view on return values of XPatch expressions: everything is a sequence (even containing a single element)
→removes the set node order problems
Introduces conditional expressions and cycles.
Introduces user-defined functions (dynamically evaluate XPath expressions)
Users can uses general and existential quantifiers, for example. exist student/name="Fred", all student/@id
For more details see http://www.saxonica.com/, (pages contains the XPath/XSLT/XQuery processor Saxon as well)..

XPath 2.0 - examples

String functions - http://www.fi.muni.cz/~tomp/xml03/xpath20/string.html
Numeric functions - http://www.fi.muni.cz/~tomp/xml03/xpath20/numeric.html
Sequence functions - http://www.fi.muni.cz/~tomp/xml03/xpath20/sequence.html
Boolean functions - http://www.fi.muni.cz/~tomp/xml03/xpath20/boolean.html