Practical XML: Parsing

Originally published: 2009-01-04

Last updated: 2015-04-27

So you've built some XML, now what do you do? Well, parse it, of course! After all, it doesn't do you much good as a bunch of bytes on a disk.

In my experience, a DOM document is the most usable form for parsed XML, because it can be accessed multiple times once parsed. By comparison, with a SAX parser you have to know exactly what you're looking for at the time you parse. Useful if you're unmarshalling JAXB objects or extracting data from a large file, not so useful if you're exploring a data structure. The downside of DOM, of course, is that it's all in memory, and the DOM implementation adds quite a bit to the memory footprint of the data.

Basic Parsing

Let's dive right into code:

Reader xml = new StringReader("<foo><bar>baz</bar></foo>");

DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();

DocumentBuilder db = dbf.newDocumentBuilder();

Document dom = db.parse(new InputSource(xml));

System.out.println("root element name = " + dom.getDocumentElement().getNodeName());

And that's all you need to parse a simple XML string. However, chances are good that you're not parsing simple literal strings, so read on …

DocumentBuilder and DocumentBuilderFactory

The DOM API is filled with design patterns, especially creational patterns: DocumentBuilderFactory is an Abstract Factory that creates DocumentBuilder instances, which are a Factory that creates Document nodes, which in turn are a factory for the other DOM nodes. You create the same factory objects when building a DOM from scratch; the sole change for parsing is that you call DocumentBuilder.parse() rather than newDocument().

I suppose that the reason for this complexity is flexibility, but it results in more mental effort for the developer. There are separate configuration settings for DocumentBuilder and DocumentBuilderFactory, and settings on the factory propagate to the builders that it produces. There are also some hidden gotchas in the way that DocumentBuilderFactory finds the DOM implementation (note that the org.w3c.dom package consists solely of interfaces), which can let a misbehaved program wreak havoc in a shared environment such as an app-server.

Input Sources and Encodings

The XML specification requires that an XML document either have a prologue that specifies its encoding, or be encoded in UTF-8 or UTF-16. But in this example I used a Java String, which is UTF-16 encoded, without a prologue. So how did it get parsed?

The answer is that the parser did not read the string directly. Instead it read from an InputSource, which knew that the source data was encoded as UTF-16 because it came from a Reader. In the real world, you're probably not parsing literal strings. Instead, you're parsing the contents of a file or socket. In that case, you should always construct your InputSource using an InputStream, and let the parser determine the encoding on its own — with two exceptions.

The first is if the encoding was specified by some external mechanism, such as HTTP's Content-Type header. In my opinion, this is the only time that you should use a Reader. Unfortunately, sometimes the Content-Type doesn't match the actual content.

And that brings up the second case: if you get XML from someone who doesn't know the rules. For example, XML generated using simple string output in a Windows environment will probably be encoded in windows-1252. Your first response to such data should be to go to the person responsible and say “fix it.” If that's impossible, a work-around is to use a Reader and specify the correct encoding.

Namespaces

Unlike HTML, the meaning of each element in an XML document is defined by the program writing or reading that element. Normally, this isn't an issue, especially if the XML is both produced and processed within the same organization. A problem occurs when the program has to process documents from multiple sources, which may apply different meaning to elements with the same name: an <invoice> element from one vendor will be very different from the like-named element from another. Even within an organization, XML data formats can undergo revision, and you may need to handle “version 1” data differently than “version 2.”

Namespaces are one part of the solution to that problem: elements retain short, readable names like “invoice,” but also have an associated namespace URI like http://example.com/receivablesV1. While namespaces do have their own quirks, most of them turn up when you're using the parsed data, not while parsing. The parser recognizes the xmlns attributes and does the right thing.

Except for one small problem: the Namespace spec was introduced in 1999, while the DOM level 1 spec was released in 1998 and knew nothing of namespaces. The JDK's XML API predated namespaces, and due to backwards compatibility you must explicitly tell it that you want namespace-aware parsing:

DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
dbf.setNamespaceAware(true);

DocumentBuilder db = dbf.newDocumentBuilder();

Note that you set this flag on the factory, not the parser. I recommend always parsing with namespaces enabled, with one exception: in legacy code that uses XPath or XSLT. As I describe elsewhere, XPath has its own hoops with regard to namespaces. But for everything else (including new code using XPath), namespace-aware parsing should be your default.

Validation

There are two terms applied to XML documents that sound the same but have very different meanings: “well-formed” and “valid.” A document is well-formed if it can be parsed by a parser: all the opening elements have corresponding closing elements, text content has been properly escaped, the encoding is correct, and so on. A “valid” document, by comparison, is one where the document's structure corresponds to some specification. A document may be well-formed — and usable — but not valid.

There are multiple ways to validate input, and this article will look at two of them: Document Type Definitions (DTD) and XML Schema (XSD). A third option is Relax NG, which tries to find a middle ground between DTD's lack of expressiveness and XSD's Byzantine structure. In my experience, it's rarely used.

Before continuing, I want to add a third, non-standard term to describe XML documents: “correct.” A validator can only check the existence, ordering, and general content of an XML file; it's equivalent to the syntax check of a Java compiler. Whether that content is actually usable by your application is another question — just like a syntactically correct program may be full of logic bugs. I once saw an attempt to use XSD to validate relationships in a complex object graph; the developer gave up after creating a 60 page schema document. Let validation do what it can, but ultimately your program must explicitly verify that an XML file contains the correct data.

DTD Validation

The Document Type Definition is part of the XML specification. A DTD describes the organization and content of an XML document in a form similar to Backus-Naur notation: a tree structure in which each element specifies the elements that it may contain (potentially none), and the order in which they must appear.

A full description of the DTD is beyond the scope of this article but the following example, for a music collection, should give its flavor. There's an element named <music>, which can hold zero or more child elements named <artist>; these in turn hold one or more <album> elements, which contain text. Finally, the <artist> element also has a required attribute name, which contains character data.

<!ELEMENT music (artist*)>
<!ELEMENT artist (album+)>
<!ELEMENT album (#PCDATA)>
<!ATTLIST artist name CDATA #REQUIRED>

An XML document identifies its DTD in a “DOCTYPE” declaration, which must appear before the first element in the document (but after the prologue!). The DOCTYPE may specify an embedded DTD, as in the example below, or it may reference an external DTD, as we'll see later.

<!DOCTYPE music [
<!ELEMENT music (artist*)>
<!ELEMENT artist (album+)>
<!ELEMENT album (#PCDATA)>
<!ATTLIST artist name CDATA #REQUIRED>
]>
<music>
    <artist name="Anderson, Laurie">
        <album>Big Science</album>
        <album>Strange Angels</album>
    </artist>
    <artist name="Fine Young Cannibals">
        <album>The Raw &amp; The Cooked</album>
    </artist>
</music>

To tell the parser to pay attention to this DOCTYPE, we need to call setValidating(true) on the factory. All parsers created by that factory will look for a DOCTYPE in their source document, and will complain if they don't find one.

DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
dbf.setNamespaceAware(true);
dbf.setValidating(true);

DocumentBuilder db = dbf.newDocumentBuilder();

Error Handlers

So what happens when the document content doesn't match the DTD? By default, the parser logs the first few errors to the console, along with a message suggesting that you use your own ErrorHandler.

Warning: validation was turned on but an org.xml.sax.ErrorHandler was not
set, which is probably not what is desired. Parser will use a default
ErrorHandler to print the first 10 errors. Please call the 'setErrorHandler'
method to fix this.
Error: URI=null Line=9: Element type "band" must be declared.
Error: URI=null Line=21: The content of element type "music" must match "(artist)*".
root element name = music

So, taking that advice, here's an error handler that will write all messages to the console. In the real world, this wouldn't be very useful (and gives no more information than the default); generally the first few messages will tell you why the parser got off the rails.

DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
dbf.setNamespaceAware(true);
dbf.setValidating(true);

DocumentBuilder db = dbf.newDocumentBuilder();
db.setErrorHandler(new ErrorHandler()
{
    @Override
    public void fatalError(SAXParseException exception) throws SAXException
    {
        System.err.println("fatalError: " + exception);
    }

    @Override
    public void error(SAXParseException exception) throws SAXException
    {
        System.err.println("error: " + exception);
    }

    @Override
    public void warning(SAXParseException exception) throws SAXException
    {
        System.err.println("warning: " + exception);
    }
});

Document dom = db.parse(new InputSource(xml));

You probably noticed is that there's not just one type of error, but three:

Each of the handler methods is passed a SAXParseException, which gives you methods to find the location of the error within your XML: getLineNumber() and getColumnNumber(). Be aware that these methods return the position where the parser decided it had a problem; the actual error is before the reported position. Note also that column numbers start at 1, not 0.

More about the DOCTYPE

In the real world an XML document isn't going to contain its own DTD. Instead, it uses a DOCTYPE that references an external DTD, via the “SYSTEM” or “PUBLIC” keywords. The J2EE 1.3 web application deployment descriptor, web.xml, is an example that uses PUBLIC:

<?xml version="1.0"?>
<!DOCTYPE web-app
          PUBLIC "-//Sun Microsystems, Inc.//DTD Web Application 2.3//EN"
	             "http://java.sun.com/dtd/web-app_2_3.dtd">
<web-app>
    ...

The “SYSTEM” and “PUBLIC” keywords both indicate that the DOCTYPE references an external DTD; the one you pick depends on how much information you want to provide. Both also require a URI: the system identifier that uniquely identifies the DTD. This is normally an HTTP URL that points to the DTD (although it can be any unique URI), and by default the parser will attempt to read that document.

The difference between “SYSTEM” and “PUBLIC” is that the latter provides an additional piece of data, the application-specific public identifier. In practice, the public identifier is superfluous: it provides a second unique identifier for the DTD, and one is enough. What this means is that you could write your web.xml using just the “SYSTEM” identifier, and a validating parser will accept it.

<?xml version="1.0"?>
<!DOCTYPE web-app SYSTEM "http://java.sun.com/dtd/web-app_2_3.dtd">
<web-app>
    ...

Entity Resolvers

The nice thing about system identifiers is that the parser will try to read them automatically. The bad thing about system identifiers is that the parser will try to resolve them automatically — and will fail if, for example, you're not connected to the Internet. Or if the DOCTYPE uses a URI that isn't a URL. To handle these situations, you attach an EntityResolver to your parser:

DocumentBuilder db = dbf.newDocumentBuilder();
// ...
db.setEntityResolver(new EntityResolver()
{
    @Override
    public InputSource resolveEntity(String publicId, String systemId)
    throws SAXException, IOException
    {
        if (systemId.equals("urn:x-com.kdgregory.example.xml.parsing"))
        {
            // normally you open the DTD file/resource here
            return new InputSource(dtd);
        }
        else 
        {
            throw new SAXException("unable to resolve entity; "
                        + "public = \"" + publicId + "\", "
                        + "system = \"" + systemId + "\"");
        }
    }
});

Document dom = db.parse(new InputSource(xml));

There are a few things that I want to call out. First, that I'm using a URN as the system ID, not a URL. This makes it very clear that the ID is just an identifier, and that you need an entity resolver to resolve it. If you have control over your domain, and want to store the DTD at an HTTP URL, then feel free to do so — although then you'd have no need for an entity resolver.

The second item is basic defensive programming: it's possible that you'll try to parse a file that references an nonexistent doctype. In that case, you want to give some information about the problem, rather than just fail with a cryptic “this document doesn't match its DOCTYPE” message.

Next, be aware that the parser might modify the system ID, to turn it into a “proper” URI. The JDK parser does just this: if you give it a system identifier “foo”, it assumes that you're referencing a file in the current directory, and will create a URI like “file://home/me/dir/foo”.

Finally, note that the function returns an InputSource, and I comment that you'd normally open it just before use. That might cause you some concern: if you create a FileInputStream, for example, who's going to close it? And can you return the same stream every time? The answer, found in the InputSource JavaDoc, is that the parser is expected to close the stream. Once you return it, you don't have to think about it again.

I'll finish up this section by showing the XML and the DTD file that it references:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE music SYSTEM "urn:x-com.kdgregory.example.xml.parsing">
<music>
    <artist name='Bangles'>
        <album>All Over The Place</album>
        <album>Different Light</album>
    </artist>
    <artist name='Beach Boys'>
        <album>Endless Summer</album>
    </artist>
    <artist name='Beatles (The)'>
        <album>Abbey Road</album>
        <album>Beatles For Sale</album>
        <album>Revolver</album>
    </artist>
</music>
<!ELEMENT music (artist*)>
<!ELEMENT artist (album+)>
<!ELEMENT album (#PCDATA)>
<!ATTLIST artist name CDATA #REQUIRED>

Note that the DTD file just contains the element and attribute definitions; it does not start with DOCTYPE, and is not an XML file (the default template for Eclipse puts an XML prologue at the top of the file, and you have to delete it to get a valid DTD.

Pitfalls of DTD Validation

Although DTD validation is simple, there are several holes that you can stumble in. Foremost is that the document must have a DOCTYPE, otherwise the parser will have no idea how to validate it. While it's possible to insert a DOCTYPE where missing, doing so can be extremely difficult: consider the case of parsing the body of an HTML POST, where you get the document as an InputStream from the HttpServletRequest.

A second pitfall is that, even when a DOCTYPE is present, you have no control over the DTD that it specifies. If the document provides an internal DTD, the parser won't look to your entity resolver. In which case the document may be completely incorrect as far as your application is concerned, and you'll never know until you try to access its data.

And finally, DTD validation looks only at the containment hierarchy, not the actual data. You can specify that element “<order>” must contain an element “<orderId>”, and that the latter must contain text data, but you can't specify that that text is a ten-digit number — or that the number matches anything in your database.

Schema Validation

For more complex validation you can use an XML schema, which allows you to specify additional constraints on document content. A simple schema for our album database might look like this:

<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">

    <xsd:element name="music" type="MusicCollectionType" />

    <xsd:complexType name="MusicCollectionType">
        <xsd:sequence>
            <xsd:element name="artist" type="ArtistType" minOccurs="0" maxOccurs="unbounded" />
        </xsd:sequence>
    </xsd:complexType>

    <xsd:complexType name="ArtistType">
        <xsd:sequence>
            <xsd:element name="album" type="xsd:string" minOccurs="0" maxOccurs="unbounded" />
        </xsd:sequence>
        <xsd:attribute name="name" type="xsd:string" />
    </xsd:complexType>

</xsd:schema>

Earlier I said that the structure and definition of a DTD were beyond the scope of this document. The structure of an XML schema is moreso: while basic schemas are simple, the details can be mind-boggling. At the end of this article you'll find links to three documents provided by the W3C that tell you almost everything you'll need to know.

That said, this example should give you the flavor of a schema: you define types for your elements and attributes, and those types can reference other types. Which leaves the question: how do you use it?

InputStream xml = // ...
InputStream xsd = // ...

SchemaFactory xsf = SchemaFactory.newInstance(XMLConstants.W3C_XML_SCHEMA_NS_URI);
Schema xs = xsf.newSchema(new StreamSource(xsd)); 

DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
dbf.setNamespaceAware(true);
dbf.setValidating(false);
dbf.setSchema(xs);

DocumentBuilder db = dbf.newDocumentBuilder();

Document dom = db.parse(new InputSource(xml));

There are a few important points: first is that you configure the factory as non-validating. The term “valid,” when applied to XML documents, implies the use of a DTD. You could simply skip the call to DocumentBuilderFactory.setValidating(), since by default it creates non-validating parsers. As a how-to, however, I wanted to make this point explicit.

Second is that you want your parser to be namespace aware. Although a simple schema may not require namespaces, you have no guarantee that an instance document won't use them. And if it does, and you haven't processed that document with a namespace-aware parser, then it will fail validation — or worse, incorrectly pass.

Finally, note that the schema instance is attached to the DocumentBuilderFactory: all parsers created by that factory will share the same schema. This is driven by practicality: since a Schema object is relatively expensive to construct, you want to reuse it. And since your application probably doesn't use a different schema for each document that it reads, it's simplest to attach it to the factory.

Multiple schemas for a single document

Given that schema documents tend to be large and complex, you may want to modularize them: for example, maintain each class as its own XSD file. Or, you may be receiving data from multiple sources, and want to keep the schema for each source separate.

This appears to be easy to do: SchemaFactory.newSchema() has a variant that takes an array of source documents, and combines them into a single Schema instance. However, this method is documented to behave as if creating a new schema document with import directives for each of the source documents. As a result, you can't combine source files with the same target namespace (there is a bug for this, but reading the schema docs leads me to believe that it's intentional), and you must order the source documents so that definitions in one namespace are available when requested from another (I haven't found anything in the docs to justify this).

Provided that your schema documents are accessible via URL or filesystem, the best solution is to create a single top-level schema document, and use explicit import and include directives to load your definitions. The Practical XML library also provides the method SchemaUtil.combineSchemas() to combine schema documents that are not loadable via URL.

Documents that specify their own Schema

The Schema specification allows an instance document to use the attributes xsi:schemaLocation and xsi:noNamespaceSchemaLocation to specify the schema to be used for its own validation (the xsi prefix refers to the Schema Instance namespace, “http://www.w3.org/2001/XMLSchema-instance”). The first attribute is used for documents with a specific target namespace, the second for documents without a namespace. The following example is from the W3C Schema Primer, redacted and reformatted to highlight attribute usage:

<purchaseReport
    xmlns="http://www.example.com/Report"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://www.example.com/Report http://www.example.com/Report.xsd"
    ...

Unlike DOCTYPE, the schemaLocation attribute is just a hint, and the parser is free to silently ignore it. And that's just what the JDK 1.4 and 1.5 parsers do: as far as they're concerned, it's just another attribute used by the program, without special meaning to the parser. The JDK 1.6 parser does support this attribute, provided that you set a parser feature that's only documented on the Xerces website:

DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
dbf.setNamespaceAware(true);
dbf.setValidating(false);
dbf.setFeature("http://apache.org/xml/features/validation/schema", true);

If you do use this feature, be aware that the document's self-specified schema will take precedence over any schemas that you set manually (at least with 1.6; I haven't tried later JDKs). Personally, I don't believe in letting a document control its own validation, so would not use this feature even if it was universally supported.

Validating after Parsing

All examples so far have been validated at the time they are parsed. The Schema object also supports validation after parsing, by creating a Validator:

InputStream xml = // ...
InputStream xsd = // ...

DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
dbf.setNamespaceAware(true);
dbf.setValidating(false);

DocumentBuilder db = dbf.newDocumentBuilder();

Document dom = db.parse(new InputSource(xml));

SchemaFactory xsf = SchemaFactory.newInstance(XMLConstants.W3C_XML_SCHEMA_NS_URI);
Schema xs = xsf.newSchema(new StreamSource(xsd));
Validator validator = xs.newValidator();
validator.validate(new DOMSource(dom));

I don't like this approach. First off, there's no benefit in terms of performance: both the schema and instance document still have to be parsed, and the validator must still examine the entire instance document. Nor does it add convenience — instead, it adds more lines of code to maintain.

But the most important reason that I don't do this is: you lose source line numbers! Once the document has been parsed, it's a tree of DOM nodes. While the validator may tell you that node “<foo>” has a problem, there may be dozens — or hundreds — of nodes with that name.

A Few Last Things

Illegal Characters

One of the nastier surprises when parsing XML is realizing that you have illegal characters in your input. For some unknown reason, the XML 1.0 spec explicitly omitted most of the non-printing characters in the range 0x00 to 0x1F; the only allowed non-printing ASCII characters are tab, newline, and carriage return. Making this restriction even more incomprehensible, almost every other Unicode character is permitted (although control characters in the range 0x7F to 0x9F are “discouraged,” as are unassigned Unicode codepoints).

The result of this specification is that the following document cannot be parsed by the JDK's parser — or by any other parser that strictly conforms to the XML 1.0 specification.

<myData>
    This is a BEL: &#7;
</myData>

The solution to parse the document as XML 1.1. This allows all ASCII control characters other than NUL (which is used to terminate C strings — prohibiting it was a matter of practicality). The one caveat is that all control characters (including those in the range 0x7F to 0x9F) now must be specified as entity references; the parser will reject any that appear unescaped.

To switch the parser into “1.1 mode,” simply attach a prologue to the document:

<?xml version="1.1"?>
<myData>
    This is a BEL: &#7;
</myData>

The jaxp.debug System Property

Each of the JDK's XML factory classes have a rather involved procedure for finding the actual implementation class. Part of this process uses system properties, and you may someday find yourself in the position of tracking down an error because someone has thoughtlessly changed the value of one of these properties in a shared environment (it happened to me). To see exactly what classes are being loaded, you can set another system property: jaxp.debug. Once this property is set, the factory methods will start logging their operation:

JAXP: find factoryId =javax.xml.parsers.DocumentBuilderFactory
JAXP: loaded from fallback value: com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderFactoryImpl
JAXP: created new instance of class com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderFactoryImpl
using ClassLoader: sun.misc.Launcher$AppClassLoader@133056f

Of course, you should only set this while you're trying to track down a classloading problem. Leaving it set will cause people to come looking for you, to explain why their logs are filled with JAXP messages.

For More Information

The Practical XML library provides utilities for working with XML in many different ways. Of particular interest for this article are the ParseUtil and SchemaUtil classes.

The Annotated XML Specification is my preferred reference for XML 1.0. Tim Bray, one of the original spec editors, has hyperlinked commentary and historical information to the source specification. Unfortunately, he has not done the same for XML 1.1.

As noted above, the official XML Schema documentation has three parts: Primer, Structures, and Datatypes. The first of these is the closest to a tutorial. The latter two are of interest primarily to language lawyers and those writing how-to articles.

Also as noted above, the Sun JDK uses the Xerces parser. This parser supports several configuration options that are not documented as part of the official Java API. Use only if you don't care about compatibility between different JREs, and also be aware that the JDK and Xerces development threads are not in lock-step, so some Xerces features may not apply to your JDK.

You can download all examples for this article as a Maven project in a JARfile. If you want to look at specific examples, click the following links:

Copyright © Keith D Gregory, all rights reserved

This site does not intentionally use tracking cookies. Any cookies have been added by my hosting provider (InMotion Hosting), and I have no ability to remove them. I do, however, have access to the site's access logs with source IP addresses.