For almost a decade, XML has been recommended by many parties as a universal format for data representation. This fact is so widely known that it hardly merits mention. An extremely noninclusive list of proponents would include the following.
Even Microsoft® Corporation, renowned far and wide for their own proprietary protocols and file formats, have begun adopting the use of XML for various purposes, including the communication protocol for Microsoft BizTalk and the storage formats for Microsoft Office.
However, I believe that XML is extremely poorly suited to this kind of work. Furthermore, I believe that I have very good reasons for believing so. I would even go as far as saying that the usage of XML is directly harmful to the computing community in general. The purpose of this document is for me to voice my opinion about this, along with my reasons for this opinion. If you believe that I am wrong about anything in this document, please let me know about it (my e-mail address is at the bottom of the page).
While I am on a larger crusade against the misuse of web standards, including HTTP, the XML issue is the issue that I feel is of largest consequence, and I will therefore dedicate this document specifically to it.
XML was invented in the mid-1990s as a more strict and flexible
successor for
SGML.
It was originally designed to meet
the challenges of large-scale electronic publishing
. In other
words, it was designed as an abstract language for assigning meaning
to chunks of text in documents. More particularly, it was designed to
be the abstract syntax notation of the successor to
HTML, XHTML. The
perceived advantage was that the web would rid itself of the
disadvantages of HTML, being primarily that different HTML parsers
would interpret the same HTML document in different ways.
As a language, XML is a byte-oriented serialized ASCII representation of the DOM. The DOM is the abstract data structure that is intended by the W3C to represent documents. As the DOM is intended to represent documents, it has a lot of structure that is very useful for documents. The DOM is a bit more advanced than this, but its basic function is to represent elements. An element has a name, can contain an ordered collection of text and other elements, and it can have an unordered collection of attributes. Attributes are key-value pairs. For example, one DOM element could be a paragraph in a document. It could contain some text to begin with, then a sub-element, which contains some more text, followed by more text. The name of the sub-element would specify that the text contained within it is intended as a hyperlink. An attribute of the sub-element would specify what resource the text should link to.
Now, let me make one thing clear: Neither XML nor the DOM are bad things. Intended and designed for the representation of documents, they are very good at that. Not optimal, but certainly very good. As you will notice by looking at the source of this very document, I have used XHTML, which is an XML dialect, to write this document. Thus, I have nothing against the use of XML as a language for document representation.
Likewise, a pneumatic drill is very good for making holes in the ground. However, although it is possible, most people tend to avoid using it for pounding nails into walls – they use a hammer for that. The consensus is, of course, that one should use the right tool for the right job. The point that I will try to make is that XML is not the right tool for representation of arbitrary data. I will now make a point-by-point argumentation for why.
As mentioned, the basic node of information in the DOM – the element – has three parts of finer structure:
While this is well suited for document representation, the fact of the matter is that in the context of more generic data structures, this is simply too much structure. While this fact does not restrict the kinds of data structures that can be represented by the DOM, it does result in the fact that many data structures do not fit cleanly in the DOM, or can fit in the DOM in multiple ways, and it is not clear why one way should be chosen before another.
To make an example, I will use the textbook example for XML data
storage – a book inventory. Books have some pieces of
information which are useful to describe then, such as the title,
author and ISBN
number. Using a hypothetical book
tag for storage of a
book, there are two distinct possibilities for storing a book, and
there is no obvious reason why one should be preferred before
another:
<book> <title>The Return of the King</title> <author>J.R.R. Tolkien</author> <isbn>0 261 10237 0</isbn> </book>
<book title="The Return of the King" author="J.R.R. Tolkien" isbn="0 261 10237 0" />
Furthermore – even though the DOM supports ordered sets of information, XML offers no way to parse an ordered set of text nodes. In XML, text nodes must be seperated by sub-elements. This fact creates problems when attempting to describe lists in XML. I have seen several solutions to this problem. To list a few:
<welcome-file-list> <welcome-file>index.html</welcome-file> <welcome-file>index.htm</welcome-file> <welcome-file>index.jsp</welcome-file> </welcome-file-list>The problem with this solution ought to be apparent: There is an enormous amount of redundant information. There is no obvious reason why there should have to be a name for each subelement storing a welcome file. In this particular case, more than half of the unparsed data (87 out of 171 bytes, counting the fewest bytes possible) has no value at all.
<li class="seen flagged answered">While this solution removes the redundant information from the example above, its problem is arguably even greater – it forces the application to implement yet another parser/printer, apart from that required for XML itself. While the parser may, in this case, be very simple, it is unelegant at best to use more than one format to structure the data.
The DOM also lacks certain primitive data types, such as numbers, since they are seldomly needed for documents. Therefore, programs using XML often store numbers as strings, which are then parsed and unparsed when XML is read and printed, respectively.
A commonly heralded feature of ASCII representations of data is that ASCII representation makes the data easy to view and edit using standard ASCII tools – text editors, printers, terminals, etc.
While this is certainly true for some ASCII file formats, and is
also true for XML when compared to most, if not all, binary file
formats, it is not true for XML when compared to many other ASCII file
formats. The biggest problem with XML for this purpose (and this is
also true when XML is used for representing documents) is the amount
of redundant information. In XML, a DOM element is represented by a
tag, followed by the element's contents, followed by an
end tag. A tag is written as the element's name, surrounded
by angle brackets (for example, <paragraph>
). The
end tag is represented in the same way, except for a slash preceding
the element's name (for example,
</paragraph>
). Now, there is a valid reason for
this, namely the intent of keeping backwards compatibility with
SGML. However, unlike SGML, XML describes a pure tree structure, which
means that when an end tag is read, there is only one single element
that it can end. Therefore, there is no technical merit in reprinting
the tag name.
The reprinting of the tag name in the end tag results in a waste of screen space when viewing an XML file, a waste of time when hand-editing XML files, and a waste of bandwidth when using XML as the basis of a network protocol. The worst aspect of this is the fact that the end tags are redundant even in the most general case – even though there are special cases involving particular schemas that involves even more redundancy, the end tags are, as described, logically and undeniably redundant in every imaginable XML schema that does not require backward compatibility with SGML.
In addition, certain XML schemas are even more cumbersome than the aforementioned general case, mainly because they use very long tag names, even though the names themselves are unnecessary and therefore redundant.
For an example of a cumbersome XML schema, see the
"welcome-file-list
" example from the previous
section. One thing to note in particular with that example is that the
tag name "welcome-file
" is completely redundant. The tags
named by it exist only in order to be able to construct a list in the
DOM, and thus have no actual need for a name, except that XML and the
DOM require it.
All programs already have their own internal data structures to represent the data they are working with. There is no obvious reason why a program should have to abandon its native data structures and stuff its data into a unified structure – such as the DOM – just to communicate it to another process or store it on disk. The only obvious reason that the DOM should be used for this is, of course, if the program already uses the DOM internally, as there would be good reasons for document editors to do.
While the real reasons that XML has become as popular as it has for this purpose are close to unfathomable to me, I can only imagine that one of the foremost reasons would be that there comes a library containing an XML parser with many modern development environments today. Java certainly has one, C and C++ have many, Perl has at least one, Python likewise, and the same goes for almost every language worthy of mention. For that reason, some programmers use XML so that they do not have to write parsers and printers. To be honest, I do not know, but I would imagine that to be a prime reason for the usage of XML.
However, let us face the fact: It is not hard to write a parser. Even if it were, there are tools such as flex and bison readily available for the task. Either way, even when using XML, one is still forced to write another type of parser: One that "parses" the already parsed DOM into the program's own data structures.
There are those who tout the usage of XML because XSLT can be used to easily convert one XML schema to another. As I have understood it, this is one of the cornerstones of Microsoft's BizTalk software. Any way I look at it, this point is completely bogus. Writing an XSLT schema adapter is just as much work as writing it in C, LISP, Java, C#, Prolog, or any other language. Arguable, it is even more work to write it in XSLT because XSLT itself is an XML dialect – compare with the point of XML being too cumbersome. As for the point of there being GUI tools available for creating XSLT adapters, there is no reason that there could not be any similar tool for creating adapters in any other language.
Author: Fredrik Tolf <fredrik@dolda2000.com>