Standards, specifications, XML processing APIs

Specifications and validity of XML

Which XML version to use?

In new applications - see W3C XML Core Working Group (http://www.w3.org/XML/Core/#Publications) for the answer:

Validity of XML documents

Document Type Definition (DTD)

Motivation for DTD, comparison

Problems with DTD?

Why use DTD?

DTD tutorials

ZVON

http://www.zvon.org/xxl/DTDTutorial/General/contents.html (including a Czech version)

DTD Tutorial

http://edutechwiki.unige.ch/en/DTD_tutorial (not just DTD but much more)

W3Schools DTD Tutorial

http://www.w3schools.com

DTDeclaration

DTDeclaration is placed immediately before the root element!

<!DOCTYPE root-elt-name External-ID [ internal part of DTD ]>

Internal or external part (internal or external subset) might or might not be present, or both can be present.

Identifiers in DTDeclaration

External identifier can be either

DTD - conditional sections

For "commenting out" portions of DTDs e.g. for experimenting:

<![IGNORE[ this will be ignored ]]>
<![INCLUDE[ this will be included into DTD (i.e. not ignored)]]>

DTD - element type definition

Describes allowed content of the element, in form of <!ELEMENT element-name … >, where … can be

EMPTY

for empty element which may be represented as <element/> or <element></element> with the same logical meaning

ANY

any element content allowed, i.e. text nodes, child elements, … may contain child elements - <!ELEMENT element-name (specification of child elements)>

mixed

containing both text and child elements given by enumeration <!ELEMENT element-name (#PCDATA | specification of child elements)*>

For MIXED, the order or cardinality of concrete child elements cannot be specified. The star (*) is required and any number of occurencies is always allowed.

DTD - element type definition - child elements

For specifying the child elements, we use:

DTD - attribute definition

Describes (data) type and/or implicit attribute values for the respective element.

<!ATTLIST element-name attribute-name attribute-value-type implicit-value>

DTD - definition of attribute value type

Allowed value types are as follows:

CDATA
NMTOKEN
NMTOKENS
ID
IDREF
IDREFS
ENTITY
ENTITIES

DTD - cardinality of attributes

Attributes may have obligatory presence:

#REQUIRED

attribute is required

#IMPLIED

attribute is optional

#FIXED "fixed-value"

is required and must have the value fixed-value

DTD - implicit attribute value

Attribute (incl. optional one) might have an implicit value: then the attribut is optional, but if not present, then the implicit value is used instead.

Physical Structure (Entities)

Entity - declaration and usage

We distinguish:

General entities

parsed

files with a (well formed) markup,

not-parsed

eg. binary files,

character entities

eg. > refers to a char entity.

Parametric entities

XML Base

XML Base - example

Example from XML Base specification http://www.w3.org/TR/xmlbase/

<?xml version="1.0"?>
<e1 xml:base="http://example.org/wine/">
  <e2 xml:base="rosé"/>
</e1>

In the example below, the base URI of element e2 should be returned as "http://example.org/wine/rosé".
[Note the use of the reserved prefix xml]

XML Namespaces

XML Namespaces 1.0

W3C Recommendation, currently Namespaces in XML 1.0 (Third Edition) W3C Recommendation 8 Dec 2009: http://www.w3.org/TR/REC-xml-names

Namespaces in XML 1.1

W3C Recommendation (http://www.w3.org/TR/xml-names11/) (Second Edition) 16 August 2006. Andrew Layman, Richard Tobin, Tim Bray, Dave Hollander

XML Namespaces

Prefixes and Equivalence of NS

Prefixes and Equivalence of NS

Default NS – example

<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
   <body>
      <h1>Hurááááá</h1>
   </body>
</html>

Explicit (prefixed) NS – example

<xhtml:html xmlns:xhtml="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
   <xhtml:body>
      <xhtml:h1>Huráááááá</xhtml:h1>
   </xhtml:body>
</xhtml:html>

Issues related to NS

XML Information Set

XML Infoset - structure

Canonical Form

Canonical Form - principles

Main principles for constructing the canonical form of an XML document:

Canonical Form - principles (contd)

Issues with Canonical Form

Certain information loss (mostly info from DTD):

API for XML Processing

XML APIs Fundamental Types

Tree-based API

Programming Language Specific Models

Event-based API

Event Examples

SAX - Document Analysis Example

<?xml version="1.0"?>
<doc>
  <para>Hello, world!</para>
  <!-- that’s all folks -->
  <hr/>
</doc>

SAX - Document Analysis Example

It generates following events:

start document start element: doc
list of attributes: empty
start element: para
list of attributes: empty
characters: Hello, world!
end element: para
comment: that’s all folks
start element: hr
end element: hr
end element: doc end document

When to use event-based API?

Optional SAX Parser Features

SAX filters

Additional SAX References

Pull-based APIs

Streaming API for XML (StAX)

StAX - an Iterator Example

StAX - source XML document

<?xml version="1.0" encoding="UTF-8"?>
<BookCatalogue xmlns="http://www.publishing.org">
<Book>
    <Title>Yogasana Vijnana: the Science of Yoga</Title>
    <author>Dhirendra Brahmachari</Author>
    <Date>1966</Date>
    <ISBN>81-40-34319-4</ISBN>
    <Publisher>Dhirendra Yoga Publications</Publisher>
    <Cost currency="INR">11.50</Cost>
</Book>
<Book>
    <Title>The First and Last Freedom</Title>
    <Author>J. Krishnamurti</Author>
    <Date>1954</Date>
    <ISBN>0-06-064831-7</ISBN>
    <Publisher>Harper &amp; Row</Publisher>
    <Cost currency="USD">2.95</Cost>
</Book>
</BookCatalogue>

StAX - source XML document

In this example, the client application pulls the next event in the XML stream by calling the next method on the parser; for example:

try {
    for (int i = 0 ; i < count ; i++) {
        // pass the file name.. all relative entity
        // references will be resolved against this
        // as base URI.
        XMLStreamReader xmlr = xmlif.createXMLStreamReader(filename,
                                   new FileInputStream(filename));
// when XMLStreamReader is created,
// it is positioned at START_DOCUMENT event.
int eventType = xmlr.getEventType();
printEventType(eventType);
printStartDocument(xmlr);
// check if there are more events
// in the input stream
while(xmlr.hasNext()) {
    eventType = xmlr.next();
    printEventType(eventType);
            // these functions print the information
            // about the particular event by calling
            // the relevant function
            printStartElement(xmlr);
            printEndElement(xmlr);
            printText(xmlr);
            printPIData(xmlr);
            printComment(xmlr);
        }
    }
}

Document Object Model (DOM)

HTML Documents Specific DOM

DOM references

DOM Implementation

Using DOM in Java

What will we need often?

Most often used interfaces are:

Example 1 - creating DOM tree from file

import java.io.IOException;
import java.net.URL;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import org.w3c.dom.Document;
import org.xml.sax.SAXException;
public class Uloha1 {
        /**
        * Constructor creating new instance of Uloha1 class by reading XML document
        * on the given URL.
        */
        private Uloha1(URL url) throws SAXException, ParserConfigurationException, IOException
                // We create new instance of factory class
                DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
                // We get new instance of DocumentBuilder using the factory class.
                DocumentBuilder builder = factory.newDocumentBuilder();
                // We utilize the DocumentBuilder to process an XML document
                // and we get document model in form of W3C DOM
                Document doc = builder.parse(url.toString());
        }
}

Example 2 - DOM tree modification

import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.NodeList;
public class Uloha1 {
        Document doc;
        /**
        * ***********************************************************************
        * Method for a salary modification. If the person’s salary is less then
        * <code>minimum</code>, the salary will increased to
        * <code>minimum>.
        * No action is performed with the rest of persons.
        */
        public void adjustSalary(double minimum) {
        // get the list of salaries
                NodeList salaries = doc.getElementsByTagName("salary");
                for (int i = 0; i < salaries.getLength(); i++) {
                        // get the salary element
                        Element salaryElement = (Element) salaries.item(i);
                        // get payment
                        double salary = Double.parseDouble(salaryElement.getTextContent());
                        if (salary < minimum) {
                                // modify the text node/content of element
                                salaryElement.setTextContent(String.valueOf(minimum));
                        }
                }
        }
}

Example 3 - storing a DOM tree into an XML file

Example of the method storing a DOM tree into a file (see Homework 1). The procedure utilizes a transformation we do not know yet. Let use it as a black box.

import java.io.File;
import java.io.IOException;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerConfigurationException;
import javax.xml.transform.TransformerException;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamResult;
import org.w3c.dom.Document;
public class Uloha1 {
   Document doc;
   /**************************************************************************
    * Method for a salary modification. If the person’s salary is less then
    * <code>minimum</code>, the salary will increased to
    * <code>minimum>.
    * No action is performed with the rest of persons.
    */
public void serializetoXML(File output) throws IOException,
TransformerConfigurationException {
// We create new instance of a factory class.
TransformerFactory factory = TransformerFactory.newInstance();
Transformer transformer = factory.newTransformer();
// The input is the document placed in a memory
DOMSource source = new DOMSource(doc);
// The transformation output is the output file
StreamResult result = new StreamResult(output);
      // Let’s make the transformation
      transformer.transform(source, result);
   }
}

Alternative tree-based models

XML Object Model (XOM)

DOM4J - practically usable tree-based model

Tree and event-based access combinations

  Events → tree::
- Allow us either to skip or to filter out the ”uninteresting” document part using the event monitoring and then
- create memory-based tree from the ”interesting” part of a document only and that part process.

  Tree → events::
- We create an entire document tree (and process it) and
- we go through the tree than and we generate events like while reading the XML file.
- It allows us easy integration of both processing types in a single application.


Virtual object models