| kdgregory.com | |
|
Blog
Food Programming Travel |
Practical XML: Parsing So you've built some XML, now what do you do? Well, parse it, of course! After all, it doesn't do you much good as a bunch of bytes on a disk. In my experience, a DOM document is the most usable form for parsed XML, because it can be accessed multiple times once parsed. By comparison, with a SAX parser you have to know exactly what you're looking for at the time you parse. Useful if you're unmarshalling JAXB objects or running an XSLT transform, not so useful if you're exploring a data structure. The downside of DOM, of course, is that it's all in memory, and the DOM implementation adds quite a bit to the memory footprint of the data. Basic ParsingLet's dive right into code:
Reader xml = new StringReader("<foo><bar>baz</bar></foo>");
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
Document dom = db.parse(new InputSource(xml));
System.out.println("root element name = " + dom.getDocumentElement().getNodeName());
And that's all you need to parse a simple XML string. However, chances are good that you're not parsing simple literal strings, so read on … Input Sources and EncodingsThe XML specification requires that an XML document either have a prologue that specifies its encoding, or be encoded in UTF-8. But in this example I used a Java String, which is UTF-16 encoded, without a prologue. So how did it get parsed? The answer is that the parser did not read the string directly. Instead it
read from an If you get XML from someone who doesn't know the rules, it may not be
UTF-8 encoded. For example, XML generated using simple string output in
a Windows environment will probably be encoded in Namespaces Because XML is a simple text format, the meaning of each element is defined
by the program writing or reading that element. Normally, this isn't an
issue, especially if the XML is both produced and processed within the same
organization. A problem occurs when the program has to process documents
from multiple sources, which may apply different meaning to elements with
the same name: an Namespaces are one part of the solution to that problem: elements retain
short, readable names like “invoice,” but also have an
associated namespace URI like Except for one small problem: the Namespace spec was introduced in 1999, while the DOM level 1 spec was released in 1998. The JDK requires its parser implementations to provide backwards compatibility, so you must explicitly tell the implementation to produce namespace-aware parsers: DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance(); dbf.setNamespaceAware(true); DocumentBuilder db = dbf.newDocumentBuilder(); Note that you set this flag on the factory, not on the parser itself. I recommend that you always set it, as there's no benefit to not doing so (in legacy code, you'll often find this step omitted, but be careful about changing it: there may be XPath or XSLT transforms that use the data, and they'll have to be changed as well). ValidationThere are two terms that are applied to XML documents,that sound the same but have very different meanings: “well-formed” and “valid.” A document is well-formed if it can be parsed by a parser: in other words, all the opening elements have corresponding closing elements, text content has been properly escaped, the encoding is correct, and so on. A “valid” document, by comparison, is one where the structure corresponds to some specification. A document may be well-formed — and usable — but not valid. There are multiple ways to validate input, and this article will look at two of them: Document Type Definitions (DTD) and XML Schema (XSD). A third option is Relax NG, which tries to find a middle ground between DTD's lack of expressiveness and XSD's Byzantine structure. It is supported by the JDK's parser, but I haven't used it and cannot in good conscience talk about how to use it. Before I continue, I want to add a third term to describe XML documents: “correct.” A validator can only check the existence, ordering, and general content of an XML file. Whether that content is actually usable by your application is another question. I once saw an attempt to use XSD to validate relationships in a complex object graph; the developer gave up after creating a 60 page schema document. Let validation do what it can, but ultimately your program must explicitly verify that an XML file contains the correct data. DTD ValidationThe Document Type Definition is part of the XML specification. A DTD describes the organization and content of an XML document in a form similar to Backus-Naur notation: a tree structure in which each element specifies the elements that it may contain (potentially none), and the order in which they must appear. A full description of the DTD is beyond the scope of this article but
the following example, for a music collection, should give its flavor.
There's an element named “ <!ELEMENT music (artist*)> <!ELEMENT artist (album+)> <!ELEMENT album (#PCDATA)> <!ATTLIST artist name CDATA #REQUIRED> An XML document identifies its DTD in a “ So, here's our music collection with DOCTYPE and embedded DTD:
<!DOCTYPE music [
<!ELEMENT music (artist*)>
<!ELEMENT artist (album+)>
<!ELEMENT album (#PCDATA)>
<!ATTLIST artist name CDATA #REQUIRED>
]>
<music>
<artist name="Anderson, Laurie">
<album>Big Science</album>
<album>Strange Angels</album>
</artist>
<artist name="Fine Young Cannibals">
<album>The Raw & The Cooked</album>
</artist>
</music>
To tell the parser to pay attention to this DOCTYPE, we need to call
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance(); dbf.setNamespaceAware(true); dbf.setValidating(true); DocumentBuilder db = dbf.newDocumentBuilder(); Error Handlers So what happens when the document content doesn't match the DTD? By default,
the parser logs the first few errors to the console, along with a message
suggesting that you use your own
ErrorHandler myErrorHandler = new ErrorHandler()
{
public void fatalError(SAXParseException exception)
throws SAXException
{
System.err.println("fatalError: " + exception);
}
public void error(SAXParseException exception)
throws SAXException
{
System.err.println("error: " + exception);
}
public void warning(SAXParseException exception)
throws SAXException
{
System.err.println("warning: " + exception);
}
};
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
dbf.setNamespaceAware(true);
dbf.setValidating(true);
DocumentBuilder db = dbf.newDocumentBuilder();
db.setErrorHandler(myErrorHandler);
The first thing that you probably noticed is that there's not just one error, but three:
Each of the handler methods is passed a More about the DOCTYPE In the real world an XML document isn't going to contain its own DTD.
Instead, it uses a DOCTYPE that references an external DTD, via the
“
<?xml version="1.0"?>
<!DOCTYPE web-app
PUBLIC "-//Sun Microsystems, Inc.//DTD Web Application 2.3//EN"
"http://java.sun.com/dtd/web-app_2_3.dtd">
<web-app>
...
The SYSTEM and PUBLIC keywords both indicate that the DOCTYPE references an external DTD; the one you pick depends on how much information you want to provide. Both require a URI: the system identifier that uniquely identifies the DTD. This is normally an HTTP URL that points to the DTD (although it can be any unique URI), and by default the parser will attempt to read that document. The difference between SYSTEM and PUBLIC is that the latter provides an
additional piece of data, the application-specific public identifier.
In practice, the public identifier is superfluous: it provides a second
unique identifier for the DTD, and one is enough (note that Tim Bray, one
of the original XML spec editors, shows a strong preference for public
identifiers and wishes that system identifiers were optional). What this
means is that you could write your
<?xml version="1.0"?>
<!DOCTYPE web-app SYSTEM "http://java.sun.com/dtd/web-app_2_3.dtd">
<web-app>
...
Entity Resolvers The nice thing about system identifiers is that the parser will try to
read them automatically. The bad thing about system identifiers is that
the parser will try to resolve them automatically — and will fail
if, for example, you're not connected to the Internet. Or if the DOCTYPE
uses a URI that isn't a URL, such as
“
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
dbf.setNamespaceAware(true);
dbf.setValidating(true);
DocumentBuilder db = dbf.newDocumentBuilder();
db.setEntityResolver(new EntityResolver()
{
public InputSource resolveEntity(String publicId, String systemId)
throws SAXException, IOException
{
return new InputSource(new StringReader(dtd));
}
});
This example is about as simple as you can get: whatever the parser asks for, it gets back a DTD defined by the application. If your application must support multiple DTDs, it can use the passed system and public identifiers to pick the appropriate one. When writing an entity resolver, always remember that the system identifier
is a URI, and the parser is allowed to manipulate the literal value
given in the document. For example, the parser will look at the following
document, assume that “ <!DOCTYPE music SYSTEM "foo"> Pitfalls of DTD Validation Although DTD validation is simple, there are several holes that you can
stumble in. Foremost is that the document must have a A second pitfall is that, even when a And finally, DTD validation will only look at the containment hierarchy,
not the actual data. You can specify that element
“ Schema ValidationThis is where XML Schema comes in: it associates a type definition with each of the document's elements and attributes.
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<xsd:element name="purchaseOrder" type="PurchaseOrderType"/>
<xsd:complexType name="PurchaseOrderType">
<xsd:sequence>
<xsd:element name="shipTo" type="USAddress"/>
<xsd:element name="billTo" type="USAddress"/>
<xsd:element name="items" type="Items"/>
</xsd:sequence>
<xsd:attribute name="orderDate" type="xsd:date"/>
</xsd:complexType>
<xsd:complexType name="USAddress">
<xsd:sequence>
<xsd:element name="name" type="xsd:string"/>
<xsd:element name="street" type="xsd:string"/>
<xsd:element name="city" type="xsd:string"/>
<xsd:element name="state" type="xsd:string"/>
<xsd:element name="zip" type="xsd:decimal"/>
</xsd:sequence>
<xsd:attribute name="country" type="xsd:NMTOKEN" fixed="US"/>
</xsd:complexType>
...
The above example was adapted from the introductory schema in the
W3C Schema Primer. It contains two type definitions:
Earlier I noted that the structure of a DTD was beyond the scope of this article. The structure of an XML schema is moreso: while basic schemas are simple, the details can be mind-boggling. The W3C documentations spans three relatively long files. On the other hand, parsing an XML document with schema validation is very
simple: you compile the XSD document into a
InputStream xml = // ... InputStream xsd = // ... SchemaFactory xsFact = SchemaFactory.newInstance(XMLConstants.W3C_XML_SCHEMA_NS_URI); Schema schema = xsFact.newSchema(new StreamSource(xsd)); DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance(); dbf.setNamespaceAware(true); dbf.setValidating(false); dbf.setSchema(schema); DocumentBuilder db = dbf.newDocumentBuilder(); Document dom = db.parse(new InputSource(xml)); There are a few important points: first is that you configure the parser
as non-validating. The term “valid,” when applied to XML documents,
implies the use of a DTD. You could simply omit the call to
Second is that you want your parser to be namespace aware. Although a simple schema may not require namespaces, you have no guarantee that an instance document won't use them. And if it does, and you haven't processed that document with a namespace-aware parser, then it will fail validation — or worse, incorrectly pass. Finally, note that the schema instance is attached to the
Multiple schemas for a single documentGiven that schema documents tend to be large and complex, you may want to modularize them: for example, maintain each class as its own XSD file. Or, you may be receiving data from multiple sources, and want to keep the schema for each source separate. This appears to be easy to do: Provided that your schema documents are accessible via URL or filesystem,
the best solution is to create a single top-level schema document, and use
explicit Documents that specify their own Schema The Schema specification allows instance document to use the attributes
<purchaseReport
xmlns="http://www.example.com/Report"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.example.com/Report http://www.example.com/Report.xsd"
...
Unlike
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
dbf.setNamespaceAware(true);
dbf.setValidating(false);
dbf.setFeature("http://apache.org/xml/features/validation/schema", true);
If you try to set this feature value on JDK 1.5, it will throw a
Validating after Parsing All examples so far have been validated at the time they are parsed. The
InputStream xml = // however you get the instance document InputStream xsd = // however you get the schema definition DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance(); dbf.setNamespaceAware(true); dbf.setValidating(false); DocumentBuilder db = dbf.newDocumentBuilder(); Document dom = db.parse(new InputSource(xml)); SchemaFactory xsFact = SchemaFactory.newInstance(XMLConstants.W3C_XML_SCHEMA_NS_URI); Schema schema = xsFact.newSchema(new StreamSource(xsd)); Validator validator = schema.newValidator(); validator.validate(new DOMSource(dom)); I don't like this approach. First off, there's no benefit in terms of performance: both the schema and instance document still have to be parsed, and the validator must still examine the entire instance document. Nor does it add convenience — instead, it adds more lines of code to maintain. But the most important reason that I don't do this is: you lose source line
numbers! Once the document has been parsed, it's a tree of DOM nodes. While
the validator may tell you that node “ A Few Last ThingsIllegal CharactersOne of the nastier surprises when parsing XML is realizing that you have illegal characters in your input. For some unknown reason, the XML 1.0 spec explicitly omitted most of the non-printing characters in the range 0x00 to 0x1F; the only allowed non-printing ASCII characters are tab, newline, and carriage return. Making this restriction even more incomprehensible, almost every other Unicode character is permitted (although control characters in the range 0x7F to 0x9F are “discouraged,” as are unassigned Unicode codepoints). The result of this specification is that the following document cannot be parsed by the JDK's parser — or by any other parser that strictly conforms to the XML 1.0 specification.
<myData>
This is a BEL: 
</myData>
The solution to parse the document as
XML 1.1. This allows all ASCII control characters other than
To switch the parser into “1.1 mode,” simply attach a prologue to the document:
<?xml version="1.1"?>
<myData>
This is a BEL: 
</myData>
The
|