table of contents

Practical XML: Parsing

So you've built some XML, now what do you do? Well, parse it, of course! After all, it doesn't do you much good as a bunch of bytes on a disk.

In my experience, a DOM document is the most usable form for parsed XML, because it can be accessed multiple times once parsed. By comparison, with a SAX parser you have to know exactly what you're looking for at the time you parse. Useful if you're unmarshalling JAXB objects or running an XSLT transform, not so useful if you're exploring a data structure. The downside of DOM, of course, is that it's all in memory, and the DOM implementation adds quite a bit to the memory footprint of the data.

Basic Parsing

Let's dive right into code:

Reader xml = new StringReader("<foo><bar>baz</bar></foo>");

DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
Document dom = db.parse(new InputSource(xml));

System.out.println("root element name = " + dom.getDocumentElement().getNodeName());

And that's all you need to parse a simple XML string. However, chances are good that you're not parsing simple literal strings, so read on …

Input Sources and Encodings

The XML specification requires that an XML document either have a prologue that specifies its encoding, or be encoded in UTF-8. But in this example I used a Java String, which is UTF-16 encoded, without a prologue. So how did it get parsed?

The answer is that the parser did not read the string directly. Instead it read from an InputSource, which knew that the source data was encoded as UTF-16 because it came from a Reader. In the real world, you're probably not parsing literal strings. Instead, you're parsing the contents of a file or socket. In that case, you should always construct your InputSource using an InputStream, and let the parser determine the encoding on its own — with one exception.

If you get XML from someone who doesn't know the rules, it may not be UTF-8 encoded. For example, XML generated using simple string output in a Windows environment will probably be encoded in windows-1252. Your first response to such data should be to go to the person responsible and say “fix it.” If that's impossible, a work-around is to use a Reader, and specify the correct encoding when you construct it.

Namespaces

Because XML is a simple text format, the meaning of each element is defined by the program writing or reading that element. Normally, this isn't an issue, especially if the XML is both produced and processed within the same organization. A problem occurs when the program has to process documents from multiple sources, which may apply different meaning to elements with the same name: an <invoice> element from one vendor will be very different from the same-named element from another. Even within an organization, XML data formats can undergo revision, and you may need to handle “version 1” data very differently than “version 2.”

Namespaces are one part of the solution to that problem: elements retain short, readable names like “invoice,” but also have an associated namespace URI like http://example.com/receivablesV1. While namespaces have their own quirks, most of them turn up when you're using parsed data, not while parsing. The parser recognizes the xmlns attributes and applies them to elements appropriately.

Except for one small problem: the Namespace spec was introduced in 1999, while the DOM level 1 spec was released in 1998. The JDK requires its parser implementations to provide backwards compatibility, so you must explicitly tell the implementation to produce namespace-aware parsers:

DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
dbf.setNamespaceAware(true);
DocumentBuilder db = dbf.newDocumentBuilder();

Note that you set this flag on the factory, not on the parser itself. I recommend that you always set it, as there's no benefit to not doing so (in legacy code, you'll often find this step omitted, but be careful about changing it: there may be XPath or XSLT transforms that use the data, and they'll have to be changed as well).

Validation

There are two terms that are applied to XML documents,that sound the same but have very different meanings: “well-formed” and “valid.” A document is well-formed if it can be parsed by a parser: in other words, all the opening elements have corresponding closing elements, text content has been properly escaped, the encoding is correct, and so on. A “valid” document, by comparison, is one where the structure corresponds to some specification. A document may be well-formed — and usable — but not valid.

There are multiple ways to validate input, and this article will look at two of them: Document Type Definitions (DTD) and XML Schema (XSD). A third option is Relax NG, which tries to find a middle ground between DTD's lack of expressiveness and XSD's Byzantine structure. It is supported by the JDK's parser, but I haven't used it and cannot in good conscience talk about how to use it.

Before I continue, I want to add a third term to describe XML documents: “correct.” A validator can only check the existence, ordering, and general content of an XML file. Whether that content is actually usable by your application is another question. I once saw an attempt to use XSD to validate relationships in a complex object graph; the developer gave up after creating a 60 page schema document. Let validation do what it can, but ultimately your program must explicitly verify that an XML file contains the correct data.

DTD Validation

The Document Type Definition is part of the XML specification. A DTD describes the organization and content of an XML document in a form similar to Backus-Naur notation: a tree structure in which each element specifies the elements that it may contain (potentially none), and the order in which they must appear.

A full description of the DTD is beyond the scope of this article but the following example, for a music collection, should give its flavor. There's an element named “<music>”, which can hold zero or more children named “<artist>”; these in turn may hold one or more “<album>” elements; the “<album>” element contains text. Finally, the “<artist>” element requires the attribute “name”, which contains character data.

<!ELEMENT music (artist*)>
<!ELEMENT artist (album+)>
<!ELEMENT album (#PCDATA)>
<!ATTLIST artist name CDATA #REQUIRED>

An XML document identifies its DTD in a “DOCTYPE” declaration, which must appear before the first element in the document. The DOCTYPE may specify an embedded DTD, as in the example below, or it may reference an external DTD, as we'll see later. The DOCTYPE itself is not the DTD.

So, here's our music collection with DOCTYPE and embedded DTD:

<!DOCTYPE music [
<!ELEMENT music (artist*)>
<!ELEMENT artist (album+)>
<!ELEMENT album (#PCDATA)>
<!ATTLIST artist name CDATA #REQUIRED>
]>
<music>
    <artist name="Anderson, Laurie">
        <album>Big Science</album>
        <album>Strange Angels</album>
    </artist>
    <artist name="Fine Young Cannibals">
        <album>The Raw &amp; The Cooked</album>
    </artist>
</music>

To tell the parser to pay attention to this DOCTYPE, we need to call setValidating(true) on the factory. All parsers created by that factory will look for a DOCTYPE in their source document, and will complain if they don't find one.

DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
dbf.setNamespaceAware(true);
dbf.setValidating(true);
DocumentBuilder db = dbf.newDocumentBuilder();

Error Handlers

So what happens when the document content doesn't match the DTD? By default, the parser logs the first few errors to the console, along with a message suggesting that you use your own ErrorHandler. So, taking that advice, here's an error handler that will write all messages (although in the real world that's not too useful: generally the first few messages will tell you why the parser got off the rails).

ErrorHandler myErrorHandler = new ErrorHandler()
{
    public void fatalError(SAXParseException exception)
    throws SAXException
    {
        System.err.println("fatalError: " + exception);
    }
    
    public void error(SAXParseException exception)
    throws SAXException
    {
        System.err.println("error: " + exception);
    }

    public void warning(SAXParseException exception)
    throws SAXException
    {
        System.err.println("warning: " + exception);
    }
};

DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
dbf.setNamespaceAware(true);
dbf.setValidating(true);
DocumentBuilder db = dbf.newDocumentBuilder();
db.setErrorHandler(myErrorHandler);

The first thing that you probably noticed is that there's not just one error, but three:

Each of the handler methods is passed a SAXParseException, which gives you methods to find the location of the error within your XML: getLineNumber() and getColumnNumber(). Be aware that these methods return the position where the parser decided it had a problem; the actual error is before the reported position. Note also that column numbers start at 1, not 0.

More about the DOCTYPE

In the real world an XML document isn't going to contain its own DTD. Instead, it uses a DOCTYPE that references an external DTD, via the “SYSTEM” or “PUBLIC” keywords. The J2EE 1.3 web application deployment descriptor, web.xml, is an example that uses PUBLIC:

<?xml version="1.0"?>
<!DOCTYPE web-app
          PUBLIC "-//Sun Microsystems, Inc.//DTD Web Application 2.3//EN"
	             "http://java.sun.com/dtd/web-app_2_3.dtd">
<web-app>
    ...

The SYSTEM and PUBLIC keywords both indicate that the DOCTYPE references an external DTD; the one you pick depends on how much information you want to provide. Both require a URI: the system identifier that uniquely identifies the DTD. This is normally an HTTP URL that points to the DTD (although it can be any unique URI), and by default the parser will attempt to read that document.

The difference between SYSTEM and PUBLIC is that the latter provides an additional piece of data, the application-specific public identifier. In practice, the public identifier is superfluous: it provides a second unique identifier for the DTD, and one is enough (note that Tim Bray, one of the original XML spec editors, shows a strong preference for public identifiers and wishes that system identifiers were optional). What this means is that you could write your web.xml using just the SYSTEM identifier, and a validating parser will accept it.

<?xml version="1.0"?>
<!DOCTYPE web-app SYSTEM "http://java.sun.com/dtd/web-app_2_3.dtd">
<web-app>
    ...

Entity Resolvers

The nice thing about system identifiers is that the parser will try to read them automatically. The bad thing about system identifiers is that the parser will try to resolve them automatically — and will fail if, for example, you're not connected to the Internet. Or if the DOCTYPE uses a URI that isn't a URL, such as “urn:uuid:d3964f17-0526-4f46-bd3a-70e03f70931b” . To handle these situations, you attach an EntityResolver to your parser:

DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
dbf.setNamespaceAware(true);
dbf.setValidating(true);
DocumentBuilder db = dbf.newDocumentBuilder();
db.setEntityResolver(new EntityResolver()
{
    public InputSource resolveEntity(String publicId, String systemId)
        throws SAXException, IOException
    {
        return new InputSource(new StringReader(dtd));
    }
});

This example is about as simple as you can get: whatever the parser asks for, it gets back a DTD defined by the application. If your application must support multiple DTDs, it can use the passed system and public identifiers to pick the appropriate one.

When writing an entity resolver, always remember that the system identifier is a URI, and the parser is allowed to manipulate the literal value given in the document. For example, the parser will look at the following document, assume that “foo” is a file on the local computer, and convert it to a “file:” URL before calling the resolver. If your resolver expects for the exact value “foo”, it won't match the request.

<!DOCTYPE music SYSTEM "foo">

Pitfalls of DTD Validation

Although DTD validation is simple, there are several holes that you can stumble in. Foremost is that the document must have a DOCTYPE, otherwise the parser will have no idea how to validate it. While it's possible to insert a DOCTYPE where missing, doing so can be extremely difficult: consider the case of parsing the body of an HTML POST, where you get the document as an InputStream from the HttpServletRequest.

A second pitfall is that, even when a DOCTYPE is present, you have no control over the DTD that it specifies. If the document uses an internal DTD, the parser won't care that you're providing a different DTD via your entity resolver.

And finally, DTD validation will only look at the containment hierarchy, not the actual data. You can specify that element “<order>” must contain an element “<orderId>”, and that the latter must contain text data, but you can't specify that that text is a ten-digit number.

Schema Validation

This is where XML Schema comes in: it associates a type definition with each of the document's elements and attributes.

<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">

  <xsd:element name="purchaseOrder" type="PurchaseOrderType"/>

  <xsd:complexType name="PurchaseOrderType">
    <xsd:sequence>
      <xsd:element name="shipTo" type="USAddress"/>
      <xsd:element name="billTo" type="USAddress"/>
      <xsd:element name="items"  type="Items"/>
    </xsd:sequence>
    <xsd:attribute name="orderDate" type="xsd:date"/>
  </xsd:complexType>

  <xsd:complexType name="USAddress">
    <xsd:sequence>
      <xsd:element name="name"   type="xsd:string"/>
      <xsd:element name="street" type="xsd:string"/>
      <xsd:element name="city"   type="xsd:string"/>
      <xsd:element name="state"  type="xsd:string"/>
      <xsd:element name="zip"    type="xsd:decimal"/>
    </xsd:sequence>
    <xsd:attribute name="country" type="xsd:NMTOKEN" fixed="US"/>
  </xsd:complexType>

  ...

The above example was adapted from the introductory schema in the W3C Schema Primer. It contains two type definitions: PurchaseOrderType and USAddress, along with a reference to a third type, Items. There's also a top-level “<xsd:element>” definition, which indicate that this schema may be used to validate documents with a root element named “<purchaseOrder>”.

Earlier I noted that the structure of a DTD was beyond the scope of this article. The structure of an XML schema is moreso: while basic schemas are simple, the details can be mind-boggling. The W3C documentations spans three relatively long files.

On the other hand, parsing an XML document with schema validation is very simple: you compile the XSD document into a Schema object, and pass the compiled schema to the parser:

InputStream xml = // ...
InputStream xsd = // ...

SchemaFactory xsFact = SchemaFactory.newInstance(XMLConstants.W3C_XML_SCHEMA_NS_URI);
Schema schema = xsFact.newSchema(new StreamSource(xsd));

DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
dbf.setNamespaceAware(true);
dbf.setValidating(false);
dbf.setSchema(schema);
DocumentBuilder db = dbf.newDocumentBuilder();

Document dom = db.parse(new InputSource(xml));

There are a few important points: first is that you configure the parser as non-validating. The term “valid,” when applied to XML documents, implies the use of a DTD. You could simply omit the call to DocumentBuilderFactory.setValidating(); by default it creates non-validating parsers.

Second is that you want your parser to be namespace aware. Although a simple schema may not require namespaces, you have no guarantee that an instance document won't use them. And if it does, and you haven't processed that document with a namespace-aware parser, then it will fail validation — or worse, incorrectly pass.

Finally, note that the schema instance is attached to the DocumentBuilderFactory: all parsers created by that factory will share the same schema. This is driven by practicality: since a Schema object is relatively expensive to construct, you want to reuse it. And since your application probably doesn't need to validate different documents against different schemas, it's simplest to attach it to the factory.

Multiple schemas for a single document

Given that schema documents tend to be large and complex, you may want to modularize them: for example, maintain each class as its own XSD file. Or, you may be receiving data from multiple sources, and want to keep the schema for each source separate.

This appears to be easy to do: SchemaFactory.newSchema() has a variant that takes an array of source documents, and combines them into a single Schema instance. However, this method is documented to behave as if creating a new schema document with import directives for each of the source documents. As a result, you can't combine source files with the same target namespace (there is a bug for this, but reading the schema docs leads me to believe that it's intentional), and you must order the source documents so that definitions in one namespace are available when requested from another (I haven't found anything in the docs to justify this).

Provided that your schema documents are accessible via URL or filesystem, the best solution is to create a single top-level schema document, and use explicit import and include directives to reference your definitions. The Practical XML library also provides the method SchemaUtil.combineSchemas() to combine schema documents that are not loadable via URL.

Documents that specify their own Schema

The Schema specification allows instance document to use the attributes xsi:schemaLocation and xsi:noNamespaceSchemaLocation to specify the schema to be used for validation (the xsi prefix refers to the Schema Instance namespace, “http://www.w3.org/2001/XMLSchema-instance”). The first attribute is used for documents with a specific target namespace, the second for documents without a namespace. The following example is from the W3C Schema Primer, redacted and reformatted to highlight attribute usage:

<purchaseReport
    xmlns="http://www.example.com/Report"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://www.example.com/Report http://www.example.com/Report.xsd"
    ...

Unlike DOCTYPE, the schemaLocation attribute is just a hint, and the parser is free to silently ignore it. And that's just what the JDK 1.4 and 1.5 parsers do: as far as they're concerned, it's just another attribute, used by the program, without special meaning to the parser. The JDK 1.6 parser does support this attribute, provided that you set a parser feature that's only documented on the Xerces website:

DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
dbf.setNamespaceAware(true);
dbf.setValidating(false);
dbf.setFeature("http://apache.org/xml/features/validation/schema", true);

If you try to set this feature value on JDK 1.5, it will throw a ParserConfigurationException. The parser will then also ignore any schema set explicitly on the DocumentBuilderFactory, in preference to the document's requested location. Personally, I don't believe in letting a document control its own validation, so would not use this feature even if it was universally supported.

Validating after Parsing

All examples so far have been validated at the time they are parsed. The Schema object also supports validation after parsing, by creating a Validator:

InputStream xml = // however you get the instance document
InputStream xsd = // however you get the schema definition

DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
dbf.setNamespaceAware(true);
dbf.setValidating(false);
DocumentBuilder db = dbf.newDocumentBuilder();
Document dom = db.parse(new InputSource(xml));

SchemaFactory xsFact = SchemaFactory.newInstance(XMLConstants.W3C_XML_SCHEMA_NS_URI);
Schema schema = xsFact.newSchema(new StreamSource(xsd));
Validator validator = schema.newValidator();
validator.validate(new DOMSource(dom));

I don't like this approach. First off, there's no benefit in terms of performance: both the schema and instance document still have to be parsed, and the validator must still examine the entire instance document. Nor does it add convenience — instead, it adds more lines of code to maintain.

But the most important reason that I don't do this is: you lose source line numbers! Once the document has been parsed, it's a tree of DOM nodes. While the validator may tell you that node “<foo>” has a problem, there may be dozens — or hundreds — of nodes with that name.

A Few Last Things

Illegal Characters

One of the nastier surprises when parsing XML is realizing that you have illegal characters in your input. For some unknown reason, the XML 1.0 spec explicitly omitted most of the non-printing characters in the range 0x00 to 0x1F; the only allowed non-printing ASCII characters are tab, newline, and carriage return. Making this restriction even more incomprehensible, almost every other Unicode character is permitted (although control characters in the range 0x7F to 0x9F are “discouraged,” as are unassigned Unicode codepoints).

The result of this specification is that the following document cannot be parsed by the JDK's parser — or by any other parser that strictly conforms to the XML 1.0 specification.

<myData>
    This is a BEL: &#7;
</myData>

The solution to parse the document as XML 1.1. This allows all ASCII control characters other than NUL (which is used to terminate C strings — prohibiting it was a matter of practicality). The one caveat is that all control characters (including those in the range 0x7F to 0x9F) now must be specified as entity references; the parser will reject any that appear unescaped.

To switch the parser into “1.1 mode,” simply attach a prologue to the document:

<?xml version="1.1"?>
<myData>
    This is a BEL: &#7;
</myData>

The jaxp.debug System Property

Each of the JDK's XML factory classes have a rather involved procedure for finding the actual implementation class. Part of this process uses system properties, and you may someday find yourself in the position of tracking down an error because someone has thoughtlessly changed the value of one of these properties in a shared environment (it happened to me). To see exactly what classes are being loaded, you can set another system property: jaxp.debug. Once this property is set, the factory methods will start logging their operation:

JAXP: find factoryId =javax.xml.parsers.DocumentBuilderFactory
JAXP: loaded from fallback value: com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderFactoryImpl
JAXP: created new instance of class com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderFactoryImpl
using ClassLoader: sun.misc.Launcher$AppClassLoader@133056f

Of course, you should only set this while you're trying to track down a classloading problem. Leaving it set will cause people to come looking for you, to explain why their logs are filled with JAXP messages.

For More Information

The example code from this article is in a single class. I've also created an example of parsing a Schema with imports and includes.

The Practical XML library provides utilities for working with XML in many different ways. Of particular interest for this article are the ParseUtil and SchemaUtil classes.

The Annotated XML Specification is my preferred reference for XML 1.0. Tim Bray, one of the original spec editors has hyperlinked commentary and historical information to the source specification. Unfortunately, he has not done the same for XML 1.1.

As noted above, the official XML Schema documentation has three parts: Primer, Structures, and Datatypes. The first of these is the closest to a tutorial. The latter two are of interest primarily to language lawyers and those writing how-to articles.

Also as noted above, the Sun JDK uses the Xerces parser. This parser supports several configuration options that are not documented as part of the official Java API. Use only if you don't care about compatibility between different JREs, and also be aware that the JDK and Xerces development threads are not in lock-step, so some Xerces features may not apply to your JDK.

Copyright © Keith D Gregory, all rights reserved