Why XML is bad for representing arbitrary data

For almost a decade, XML has been recommended by many parties as a universal format for data representation. This fact is so widely known that it hardly merits mention. An extremely noninclusive list of proponents would include the following.

The XML-RPC group
The XML Protocol Working Group (and the entire Web Services community)
GNOME
Jabber

Even Microsoft® Corporation, renowned far and wide for their own proprietary protocols and file formats, have begun adopting the use of XML for various purposes, including the communication protocol for Microsoft BizTalk and the storage formats for Microsoft Office.

However, I believe that XML is extremely poorly suited to this kind of work. Furthermore, I believe that I have very good reasons for believing so. I would even go as far as saying that the usage of XML is directly harmful to the computing community in general. The purpose of this document is for me to voice my opinion about this, along with my reasons for this opinion. If you believe that I am wrong about anything in this document, please let me know about it (my e-mail address is at the bottom of the page).

While I am on a larger crusade against the misuse of web standards, including HTTP, the XML issue is the issue that I feel is of largest consequence, and I will therefore dedicate this document specifically to it.

A brief introduction to XML

XML was invented in the mid-1990s as a more strict and flexible successor for SGML. It was originally designed to meet the challenges of large-scale electronic publishing. In other words, it was designed as an abstract language for assigning meaning to chunks of text in documents. More particularly, it was designed to be the abstract syntax notation of the successor to HTML, XHTML. The perceived advantage was that the web would rid itself of the disadvantages of HTML, being primarily that different HTML parsers would interpret the same HTML document in different ways.

As a language, XML is a byte-oriented serialized ASCII representation of the DOM. The DOM is the abstract data structure that is intended by the W3C to represent documents. As the DOM is intended to represent documents, it has a lot of structure that is very useful for documents. The DOM is a bit more advanced than this, but its basic function is to represent elements. An element has a name, can contain an ordered collection of text and other elements, and it can have an unordered collection of attributes. Attributes are key-value pairs. For example, one DOM element could be a paragraph in a document. It could contain some text to begin with, then a sub-element, which contains some more text, followed by more text. The name of the sub-element would specify that the text contained within it is intended as a hyperlink. An attribute of the sub-element would specify what resource the text should link to.

The fundamental problems with XML

Now, let me make one thing clear: Neither XML nor the DOM are bad things. Intended and designed for the representation of documents, they are very good at that. Not optimal, but certainly very good. As you will notice by looking at the source of this very document, I have used XHTML, which is an XML dialect, to write this document. Thus, I have nothing against the use of XML as a language for document representation.

Likewise, a pneumatic drill is very good for making holes in the ground. However, although it is possible, most people tend to avoid using it for pounding nails into walls – they use a hammer for that. The consensus is, of course, that one should use the right tool for the right job. The point that I will try to make is that XML is not the right tool for representation of arbitrary data. I will now make a point-by-point argumentation for why.

The DOM is too specialized

As mentioned, the basic node of information in the DOM – the element – has three parts of finer structure:

A name, represented by a string
An ordered set of text nodes or other elements
An unordered set of key-value pairs (attributes)

While this is well suited for document representation, the fact of the matter is that in the context of more generic data structures, this is simply too much structure. While this fact does not restrict the kinds of data structures that can be represented by the DOM, it does result in the fact that many data structures do not fit cleanly in the DOM, or can fit in the DOM in multiple ways, and it is not clear why one way should be chosen before another.

To make an example, I will use the textbook example for XML data storage – a book inventory. Books have some pieces of information which are useful to describe then, such as the title, author and ISBN number. Using a hypothetical book tag for storage of a book, there are two distinct possibilities for storing a book, and there is no obvious reason why one should be preferred before another:

Using subelements for each attribute:

<book>
    <title>The Return of the King</title>
    <author>J.R.R. Tolkien</author>
    <isbn>0 261 10237 0</isbn>
</book>

Using element attributes for each attribute:

<book title="The Return of the King"
      author="J.R.R. Tolkien"
      isbn="0 261 10237 0" />

Furthermore – even though the DOM supports ordered sets of information, XML offers no way to parse an ordered set of text nodes. In XML, text nodes must be seperated by sub-elements. This fact creates problems when attempting to describe lists in XML. I have seen several solutions to this problem. To list a few:

Using subelements. An example from the Tomcat configuration file:
```
<welcome-file-list>
    <welcome-file>index.html</welcome-file>
    <welcome-file>index.htm</welcome-file>
    <welcome-file>index.jsp</welcome-file>
</welcome-file-list>
```
The problem with this solution ought to be apparent: There is an enormous amount of redundant information. There is no obvious reason why there should have to be a name for each subelement storing a welcome file. In this particular case, more than half of the unparsed data (87 out of 171 bytes, counting the fewest bytes possible) has no value at all.
Using parsed attributes. An example from HTML(!):
```
<li class="seen flagged answered">
```
While this solution removes the redundant information from the example above, its problem is arguably even greater – it forces the application to implement yet another parser/printer, apart from that required for XML itself. While the parser may, in this case, be very simple, it is unelegant at best to use more than one format to structure the data.

The DOM also lacks certain primitive data types, such as numbers, since they are seldomly needed for documents. Therefore, programs using XML often store numbers as strings, which are then parsed and unparsed when XML is read and printed, respectively.

XML is too cumbersome

A commonly heralded feature of ASCII representations of data is that ASCII representation makes the data easy to view and edit using standard ASCII tools – text editors, printers, terminals, etc.

While this is certainly true for some ASCII file formats, and is also true for XML when compared to most, if not all, binary file formats, it is not true for XML when compared to many other ASCII file formats. The biggest problem with XML for this purpose (and this is also true when XML is used for representing documents) is the amount of redundant information. In XML, a DOM element is represented by a tag, followed by the element's contents, followed by an end tag. A tag is written as the element's name, surrounded by angle brackets (for example, <paragraph>). The end tag is represented in the same way, except for a slash preceding the element's name (for example, </paragraph>). Now, there is a valid reason for this, namely the intent of keeping backwards compatibility with SGML. However, unlike SGML, XML describes a pure tree structure, which means that when an end tag is read, there is only one single element that it can end. Therefore, there is no technical merit in reprinting the tag name.

The reprinting of the tag name in the end tag results in a waste of screen space when viewing an XML file, a waste of time when hand-editing XML files, and a waste of bandwidth when using XML as the basis of a network protocol. The worst aspect of this is the fact that the end tags are redundant even in the most general case – even though there are special cases involving particular schemas that involves even more redundancy, the end tags are, as described, logically and undeniably redundant in every imaginable XML schema that does not require backward compatibility with SGML.

In addition, certain XML schemas are even more cumbersome than the aforementioned general case, mainly because they use very long tag names, even though the names themselves are unnecessary and therefore redundant.

For an example of a cumbersome XML schema, see the "welcome-file-list" example from the previous section. One thing to note in particular with that example is that the tag name "welcome-file" is completely redundant. The tags named by it exist only in order to be able to construct a list in the DOM, and thus have no actual need for a name, except that XML and the DOM require it.

A unified data structure is unnecessary

All programs already have their own internal data structures to represent the data they are working with. There is no obvious reason why a program should have to abandon its native data structures and stuff its data into a unified structure – such as the DOM – just to communicate it to another process or store it on disk. The only obvious reason that the DOM should be used for this is, of course, if the program already uses the DOM internally, as there would be good reasons for document editors to do.

While the real reasons that XML has become as popular as it has for this purpose are close to unfathomable to me, I can only imagine that one of the foremost reasons would be that there comes a library containing an XML parser with many modern development environments today. Java certainly has one, C and C++ have many, Perl has at least one, Python likewise, and the same goes for almost every language worthy of mention. For that reason, some programmers use XML so that they do not have to write parsers and printers. To be honest, I do not know, but I would imagine that to be a prime reason for the usage of XML.

However, let us face the fact: It is not hard to write a parser. Even if it were, there are tools such as flex and bison readily available for the task. Either way, even when using XML, one is still forced to write another type of parser: One that "parses" the already parsed DOM into the program's own data structures.

There are those who tout the usage of XML because XSLT can be used to easily convert one XML schema to another. As I have understood it, this is one of the cornerstones of Microsoft's BizTalk software. Any way I look at it, this point is completely bogus. Writing an XSLT schema adapter is just as much work as writing it in C, LISP, Java, C#, Prolog, or any other language. Arguable, it is even more work to write it in XSLT because XSLT itself is an XML dialect – compare with the point of XML being too cumbersome. As for the point of there being GUI tools available for creating XSLT adapters, there is no reason that there could not be any similar tool for creating adapters in any other language.

This site attempts not to be broken.

Author: Fredrik Tolf <fredrik@dolda2000.com>
Last changed: Tue Jun 20 19:35:55 2006