XIRUSS-T XML Usage Approach


As a generic versioning content management system, XIRUSS is not specific to XML--it can manage documents of any type in any format given appropriate importers and exporters. However, the motivation for building the XIRUSS system is primarily to explore techniques and practicalities of managing XML hyperdocuments. Therefore, XIRUSS as delivered is presented as being an XML-aware content management system. In particular, it provides a built-in generic XML importer that implements the importing of XInclude bounded object sets.

As an XML-aware content management system users will have certain expectations for what XIRUSS will do, many of which XIRUSS will not satisfy because I have made a conscious and considered decision to not support certain features of XML. This document explains why these features of XML have been explicitly not supported in XIRUSS.

The XIRUSS-T system implementation imposes the following policies on the use of entities and DOCTYPE declarations:

These restrictions reflect both a basic philosophy of how XML should be used as well as recognition of practical realities in the context of XML-aware content management.

Entities Not Supported

The entity mechanism in XML provides three features:

XML defines two distinct namespace of entities: general entities and parameter entities. General entities are used within element markup. Parameter entities are used only within markup declarations (i.e., within the DOCTYPE declaration). Parameter entities are always parsed (there is no notion of unparsed parameter entities in XML).

By definition every XML document consists of at least a "document entity", that is, the file that contains the XML declaration and the root element of the document. XML documents that have DOCTYPE declarations may have an unnamed external DTD subset entity, additinal internal or external parameter entities, any number of internal parsed entities, and any number of external parsed entities.

General entities are nominally useful as a convenience but have the effect of seriously complicating the storage and management of XML documents. Parameter entities (DTDs) have similar complications (see "DTDs Not Supported"). In addition, all of the functionality of entities can be provided through purely element-based mechanisms, including, but not limited to, XInclude.

The key aspect of entities is that they are syntactic constructs that are resolved by the XML parser and not normally exposed to processing applications. Entity references are not objects in the way that elements are in that they cannot be given any instance-specific properties (you can't put attributes on an entity reference) and no addressing specification (e.g., XPath) provides a way to address entity references. In particular, there is no way to address an entity reference by unique identifier within a given document. The best you can do is address entity references by structural position.

External Parsed Entities Are Hard To Manage

By the general concept of "compound document" used in the XIRUSS system, an XML document composed of a document entity and one or more external parsed entities is a compound document, in the sense that two or more storage objects together form a single logical unit of processing or management. However, there is a key difference between an entity-based compound document and an XInclude-based compound document: the nature of the references.

In an entity-based compound document the references to the entities are not objects, which makes them hard to manage. In particular, it is not possible to point to an entity reference using a standard addressing syntax (e.g., XPath) in order to select it.

In addition, external unparsed entities are not objects because they must be parsed and validated within the context of the document entity that directly or indirectly references them. For example, an external parsed entity that itself contains and entity reference cannot be processed or validated because the declaration of the entity cannot be known given only the information in the external parsed entity--the declaration can only be in the document entity that itself declares the external parsed entity.

Finally, the unavoidable syntactic dependencies among the external parsed entities that make up a multi-entity document impose severe constraints and can lead to unresolvable deadlocks, such as two entities using the same attribute value such that an entity used in document 1 is valid but in the context of document 2 is invalid.

The whole point of XInclude is to replace the use of external parsed entities for re-use with true objects, namely complete XML documents. Therefore the XIRUSS system makes the concious, deliberate, and considered decision to not preserve any external parsed entities on import, normalizing all documents to be "single-entity" documents, that is, documents consisting of only a document entity.

How Internal Parsed Entities Complicate Import

In theory internal parsed entities should not be an issue for content management, at least as long as documents consist of only a document entity. This is because in this case their declaration and use is entirely contained within a single storage object and as content management is fundamentally at the storage object level they should not be a problem.

However, in practice internal entities are a pain because one of the main tools used to access and process XML documents, XSLT, does not preserve entities. (Entities can be preserved using DOM level 2 or greater or the lower-level SAX API for accessing XML documents.)

The immediate implication is that import processes that are XSLT-based will not be able to preserve entities on import because the entities will be resolved before the XSLT engine sees the data. This means that such import processes cannot satisfy the basic identity test where you import a document, immediately export it, and then compare the exported document with the original import source to determine if they are identical (modulo required normalization such as conversion of newlines in attribute values into spaces or rewriting of cross-document pointers). XSLT-processes that must rewrite the documents on import will always fail this test if the input document uses entities.

Thus an import process that wants to preserve internal entities must do its rewriting with a DOM level 2 or SAX-based processor (which is probably a good idea anyway). Likewise, any export processor that does rewriting will need to be SAX or DOM based.

By giving up the use of internal parsed entities the system can be made much simpler and the available selection of tools becomes much wider. By resolving all entities on import the system does not set an expectation of entity preservation that may likely not be met in the future.

In a production system applied to a specific business use case in which the use of internal parsed entities is required for some reason it would of course be possible to implement the preservation of internal entities but I feel that internal entities are simply not worth the effort. Most, if not all, of the requirements that internal parsed entities are used to satisfy can be satisfied through simple element-based approaches, for example providing a way to use elements to declare values that can then be referenced using other elements. With modern XML processing tools such elements are trivial to implement and make it much easier to control and manage their creation and use.

In a re-use environment there is often a requirement to have the ability to refer to values that will depend entirely on the context in which the referencing element is used. In an XInclude-based system this requirement can only be satisfied using an element-based approach. This approach is often called "reflection" and has been used for many years, for example in the semiconductor industry standard Pinnacles PCIS document type.

Reflection is essentially a special case of use-by-reference and could be implemented as a direct use of XInclude. However, because the nature of the references is usually document type and data domain specific it is usually more effective in practice to define a specialized, use-case-specific reflection mechanism. For example, the data reflected may come from an external data source, such as a parts database, rather than from other XML elements. By having specialized elements it is easier to rebind the reference to a different data source in the future without having to change the markup of the reference.

Because DOM Level 2 does make it relativley easy to preserve internal entity declarations and references XIRUSS-T does preserve them on import. However, I recommend against the use of internal entities in general in preference to element-based reflection mechanisms.

Special Characters and Numeric Character References

Because XML requires the support for Unicode as an encoding for the characters in XML documents, there is no need, in general, to use either special character entities or numeric character references (ሴ) for characters--all necessary characters can simply be stored directly as characters. Of course, this requires that you use editors that support Unicode, but most, if not all, modern XML-aware editors and most text editors support Unicode.

Special characters are really a holdover from the old SGML days when character sets were much more limited. With Unicode the complications of character entities are really not justifyable. Because they are just a specialized use of internal parsed entities, all the same arguments against internal parsed entities also apply to special character entities.

The use of numeric character references is a matter of taste or practicality and should not effect the management of the data in that changing a character to or from a numeric character reference does not change the actual XML content--the character is the same and the parser will report exactly the same character in both cases. In general, it is a function of XML serializers to decide when to use numeric character references, normally as a function of the actual encoding used to store the XML file.

Within XIRUSS-T, all documents are stored in either UTF-8 or UTF-16 encoding, depending on either what the encoding of the imported document is or what options have been set on import. On export, XIRSUSS-T can generate any encoding supported by the available XML serializers.

DTDs Not Supported

XML DTDs are a useful mechanism for defining constraints on documents and, until very recently, were the only mechanism that could be counted on to be supported in all XML tools for which validation was a requirement (e.g., editors, validating parsers, etc.). The DTD syntax is elegant and optimally adapted to its task. It is compact and easy to learn and read. I have no quarrel with DTD syntax as a constraint specification lanuage.

The problem is the way that DTDs are managed syntactically within XML documents. DTDs are either syntactically part of the document entity's internal DTD subset or held in one or more external parameter entities. This presents two problems.

The first problem is that external parameter entities are entities with all of the problems that entails. External parameter entities are syntactically part of the document that references them and therefore must be imported and exported with it. This is a tractible problem but runs into two practical realities.

First, most users of XML pretend as though external DTD subsets were re-usable objects (they are not, because they are entities, which, as explained above, are not reliably re-usable) and therefore expect them to occur in one place in the system. In practice, this is not warranted because there is nothing magical about the public or system identifier used in a DOCTYPE declaration to refer to the external DTD subset--it's just a file pointer and there is no reasonable expectation that two files with the same name but in different locations on the file system will in fact have the same content. This means that on import you must always create a new copy of the external DTD subset if it is not being imported from exactly the same place the second time. That is, unless you are 100% sure that the DTD you're importing is the same as one you already have in the system, you cannot safely import the new one as a new version of an existing one.

Second, DOM parsers do not provide an API for accessing the parameter entity structure of external parameter entities. This means that importing an external DTD subset, which is often composed of several external parsed entities, cannot be implemented in a DOM processor (and may not be implementable in a SAX processor). There are tools that do DTD parsing that could be used, but again this significantly complicates the import processing.

Because general support for XML schemas is widely available (e.g., in the Xerces parse package) and is now in most, if not all, commercial XML tools, the use of DTDs can be safely replaced with the use of schemas simply in order to avoid the syntactic complications that DTDs imply.

Irrespective of the difference in functionality and syntax between DTDs and schemas, schemas have the singularly compelling feature that they are never a syntactic part of the documents they govern. This means that schemas are true objects and can therefore be used and managed independently of the documents they govern. In addition, schemas may be applied to documents unilaterally, meaning that a given document can be validated on import or otherwise characterized by a governing schema without that document needing to first declare conformance to the schema.

For these reasons then the XIRUSS system uses schemas exclusively to associate imported documents with governing business rules, including declarations of element types and attributes that may have application-specific meaning that needs to be supported and accounted for during import (e.g., specialized hyperlinks, specialized XInclude elements, etc.).

On import, XIRUSS-T will use referenced external DTD subsets to parse the document but will not preserve the external DTD reference in the imported result. Any internal parsed or unparsed entity declarations will be preserved, simply because it's easy enough to do so using DOM level 2 and Xerces.